You are on page 1of 4

S4b.

9
PITCH ESTIMATION AND VOICING DETECTION
BASED ON A SINUSOIDAL SPEECH MODEL

Robert J. McAulay and Thomas F. Quatieri

Lincoln Laboratory, MIT


Lexington, MA 02173-9108

sinusoidal waveform by another for which all of the frequencies are


harmonic. This latter waveform can be modeled ds

ABSTRACT
A new technique for estimating the pitch of a speech waveform
is developed that fits a harmonic set of sine waves to the input data
using a mean-squared-error (MSE) criterion. By exploiting a sinusoidal
model for the input speech waveform, a new pitch estimation criterion is
derived that is inherently unambiguous, uses pitch-adaptive resolution,
uses small-signal suppression to provide enhanced discrimination, and
uses amplitude compression to eliminate the effects of pitch-formant
interaction. The normalized minimum mean-squared-error proves to
be a powerful discriminant for estimating the likelihood that a given
frame of speech is voiced.

INTRODUCTION
An analysis/synthesis system has been developed based on a sinusoidal representation for speech that leads to synthetic speech that
is essentially perceptually indistinguishable from the original [l].The
question arises as to whether the parameters of the sinusoidal model,
the amplitudes, frequencies and phases of the underlying sine waves,
can be coded at low data rates (2400-4800 b/s) and result in a highquality speech compression system. Although straightforward coding
of each of the parameters would lead to the most robust system, the
attendant data rate would be too high, hence speech-specific properties
must be introduced to reduce the size of the parameter set to be quantized. One of the fundamental models used in low-rate coding is the
assumption that voiced speech is periodic, which suggests that perhaps
the sine-wave frequencies could he coded in terms of a harmonic series.
In this paper an algorithm is derived that fits a harmonic set of
sine waves to the measured set of sine waves for voiced and unvoiced
speech. The accuracy of the harmonic fit becomes an indicator of the
voicing state and is used to define the probability of voicing which can
be used to allow for mixed voiced/unvoiced excitations. The method
has proven to be a powerful pitch estimation technique that has found
wide application beyond the original lowrate coding problem.

K(wa)

i(n;wo, $1 =

+ 4t)l

(2)

where W O = 2?rf0/fs is the fundamental frequency, K(w0) is the number


of harmonics in the speech bandwidth, A(w) is the vocal tract envelope, 4 = {41,q52,. .. , d ~ ( ~ , ,represents
)}
the phases of the harmonics,
and fs is the rate at which the waveform is sampled. Henceforth, WO
will be referred to as the pitch, although during unvoiced speech this
terminology is not meaningful in the usual sense. It is desired to estimate the pitch frequency WO and the phases { & , 4 2 , . . . ,$ K ( ~ , ) } ,such
that i(n) is as close as possible to s(n) according to some meaningful
criterion,

A reasonable estimation criterion is to seek the minimum of the


mean-squared-error (MSE),
NI2

over w,, and 4, since this at least insures robustness against additive
white Gaussian noise [2]. The MSE in Eq. (3) can be expanded as

(4)
If the sinusoidal representation for s(n), Eq. (l), is used in the first
term of Eq. (4), then the power in the measured signal can be defined

Substituting Eq. (2) in the second term of Eq. (4) leads to the relation
NI 2

PARAMETER ESTIMATION FOR THE


T H E HARMONIC SINE-WAVE MODEL

A(kwo)exppli(nkwo
k=l

NI 2

K(w0)

s(n)i*(n; W O , 4) =

n=-N/2

A(kwo)ezp(jdk)
k=l

s(n)ezp(-jnkwo)
n=-N/2

(6)
As a first step in the analysis procedure, it is assumed that a frame
of the input speech waveform has already been analyzed in t e r m of its
sinusoidal components using the technique described in [l].The speech
measured data, s(n) can therefore be represented as
L
s(n)

Arezpplj(nwr

+ 811

Finally, substituting Eq. (2) in the third term of Eq. (4) leads to the
relation

NI2

(1)

I=1

where {Al,wl,
represent the amplitudes, frequencies, and phases
of the L measured sine waves. The goal is to try to represent this

where the approximation is valid provided the analysis window satisfies


the condition (N+1) >> 2?r/wo, which is more or less assured by making
the analysis window 2.5 times the average pitch period. Letting

This work was sponsored by the Department of the Air Force.

249
CH2847-2/90/0000-0249 $1.00 6 1990 IEEE

but the second term, because it is an envelope and always non-zero,


will increase at the submultiples of w'. As a consequence
W*

p(-)

denote the short-time Fourier transform (STFT) of the input speech


signal, and using this in Eq. (6), then the MSE in Eq. (4) becomes
(wold) = P,

-2

K(wo)

~ e

A(kwo)ezp(-j~k)S(kwo)

A2(kwo)

(9)
Since the phase parameters {$k}fLYo) only affect the second term in
Eq. (9), the MSE will be minimized by choosing
i k

= ars[S(kwo)l

K(w0)

PITCH-ADAPTIVE RESOLUTION
In the above formulation it was implied that the analysis window
was fixed a t N
1 samples. This would mean that the main lobe of
the sinc-function, which measures the distance of the measured sinewave frequencies from the harmonic candidates (i.e., sinc(wc - kwo)
(we - kwo)' for Iw( - kwol small) would be fixed for all pitch candidates. This is contrary to the fact that the ear is perceptually tolerant
to larger errors in the pitch a t high pitch frequencies than at lower
pitch frequencies. Moreover, the sine-function distance measure of the
error is meaningful only over each harmonic lobe. These effects can
be accounted for by defining the distance function D ( z ) at the kfth
harmonic lobe t o be

K(wo)

A(kWO)lS(kWO)l
k=1

(19)

(10)

and the resulting MSE will be given by


(WO) = P, - 2

m = 2,3,. .

which shows that the MSE criterion leads to unambiguous pitch estimates. This is possibly its most significant attribute, as it has been
found through extensive experimentation that the usual problems with
pitch period doubling do not occur with this metric. However, the
frequency domain implementation can lead to additional processing
advantages, the first of which is pitch-adaptive resolution.

K(wo)

k=l

k=l

< p(w*)

P(kW0)

(11)

k=l

The unknown pitch affects only second and third terms in Eq. ( l l ) ,
and these can be combined by defining

and the MSE can then be expressed as


and to be zero elsewhere. In this way the resolution becomes very sharp
a t low pitch values, and in contrast becomes quite broad a t high values
of the pitch. It is this expression which is used in (17) to compute the
first revised mean-squared-error.

Since the first term is a known constant, the minimum-mean-squarederror (MMSE) is obtained by maximizing p(w0) over WO.

ENHANCED DISCRIMINATION

It is useful to manipulate this metric further by making explicit use


of the sinusoidal representation of the input speech waveform. Substituting the representation in Eq. (1) in the STFT defined in Eq. (8)
leads t o the expression

The MSE criterion is closely related to the design of a Gaussian


classifier for which the classes, the pitch candidates, are assumed to be
independent. It is desirable that the classification algorithm not only
detect the correct class with high probability, but also suppress the
likelihood that any other class might be detected. This feature, which
in a neural net classifier is known as negative reinforcement [3], can be
incorporated into the MSE pitch estimation algorithm by noting that
if W O were the true pitch, then there would be at most one measured
sine wave in each harmonic lobe tuned to WO.Therefore, if there are
more, then only the one that contributes most to the MSE should
be computed. Since the lobes are determined by tbe pitch-adaptive
sinc-function in (20) and, since each lobe spans one harmonic interval
defined by the set

S(w) = C A t e z p ( j 0 l ) sinc(wl - w)

(14)

k l

where

Since the sine waves are well-resolved, the magnitude of the STFT can
then be approximated by

L(kW0) = {w : kwo -

K(wo)

A(kwo)[kAtD(wt - kwo) - iA(kwo)]


k=1

(21)

then discrimination will be enhanced by allowing only the largest weighted


sine wave for each harmonic lobe. The second revision to the MSE pitch
estimation criterion becomes

where D(+)= Isinc(z)l. The MSE criterion then becomes


~ ( w o=
)

< w < kwo + WO


-}
2

2 -

(17)

k l

To gain some insight into the meaning of this criterion, suppose


that the input speech is periodic with pitch frequency w*. Then wl =
h*,
A1 = A(&*) and

In addition to providing greater robustness against additive noise (since


the small peaks due to noise are ignored), the enhanced MSE criterion
insures that speech of low pitch will less likely be estimated as a high
pitch. Moreover, if the above implementation is thought of as a form of
small-signal-suppression and, if the harmonic lobe structure is thought
of as an auditory critical band filter, then it is possible to speculate that
enhanced discrimination is not unlike the effect of auditory masking of
small tones by nearby large tones 141.

When W O corresponds to submultiples of the pitch, the first term in


(17) remains unchanged, since D(wl - kwo) = 0 at the submultiples;

250

THE FORMANT INTERACTION PROBLEM


One of the more important pitch estimation techniques in current
use is based on the correlation function. In some respects it is the time
domain duality to the correlation implicit in the first term in (12). One
problem with the time-domain correlation technique is that it is inherently ambiguous which requires the use of some type of frame-to-frame
pitch tracking. Another problem arises as a result of the interaction
between the pitch and the first formant. If the formant bandwidth
is narrow relative to the harmonic spacing, the correlation function
reflects the formant frequency rather than the underlying pitch. Nonlinear time-domain processing techniques using various types of centerclipping have been developed to eliminate the problem [5].
The same effect manifests itself in the frequency domain as the
sine-wave amplitude near the formant frequency will tend to dominate
the MSE criterion. This effect can be eliminated simply by reducing
the dynamic range of all of the sine-wave amplitudes and, in turn, the
amplitude envelope. One way to do this is to replace the measured
sine-wave amplitudes by

where Amaz = n a + { A ~ } f = ~Since


.
the MSE criterion leads to maximal
robustness against additive white Gaussian noise, it was desirable to
keep y as close to unity as possible, introducing just enough amplitude
compression to eliminate the formant interaction problem. Too much
compression causes the low level peaks due to noise to distort the MSE
criterion. Ultimately, the compression factor was chosen to be y = .5,
having been determined experimentally using a real-time system to
process approximately two hours of speech for a variety of speech and
noise conditions.

SINE-WAVE AMPLITUDE ENVELOPE


ESTIMATION
It has been shown that if the envelope of the sine-wave amplitudes
is known, then the MSE criterion can lead to unambiguous estimates
of the pitch. While a number of methods might be used for estimating
the envelope using linear prediction or cepstral estimation techniques,
for example, it was desirable to use a method that led to an envelope
that passed through the measured sine-wave amplitudes. Such a technique has already been developed in the Spectral Envelope Estimation
Vocoder (SEEVOC) [6]. The method depends on having an estimate
of the average pitch, denoted here by GO.The first step is to search for
the largest sine-wave amplitude in the frequency range [f,
f].
Having found the amplitude and frequency of that peak, denoted here by
AI,^), then the interval [q ?f,q
is searched for its largest
peak, ( A 2 , w z ) . The process is continued throughout the speech band.
If no peak is found in a search bin, then the largest end-point of the
S T F T magnitude is used and placed at a frequency a t the bin center.
In the original SEEVOC application the goal was to obtain an estimate
of the vocal tract envelope for use in a lowrate vocoder. This was done
by linearly interpolating between the successive log-amplitudes using
the peaks determined by the above search procedure. In the application to MSE pitch estimation, however, the purpose of the envelope is
mainly to eliminate pitch ambiguities. Since the linearly interpolated
envelope could affect the fine structure of the MSE criterion through its
interaction with the measured peaks in the correlation operation, better performance was obtained by using piecewise constant interpolation
between the SEEVOC peaks.

model. This implicitly assumes that the analysis has been performed
using a Hamming window approximately 2.5 times the average pitch.
Moreover, the SEEVOC technique also assumes that an estimate of the
average pitch is available. It seems, therefore, that the pitch has to be
known in order to estimate the average pitch, in order to estimate the
pitch. This circular dilemma can be broken by using some other method
to estimate the average pitch based on a fixed window. Since only an
average pitch value is needed, the estimation technique does not have
to be accurate on every frame; hence, any of the well-known techniques
can be used. In a future paper, a method using the sinusoidal model
and the MSE criterion will be described that has the advantages of
the present technique but which operates on a fixed analysis window
and requires no amplitude estimate. It is not as reliable as the present
method, but it is good enough to estimate the average pitch.

VOICING DETECTION
In the context of the sinusoidal model the degree to which a given
frame of speech is voiced is determined by the degree to which the
harmonic model fits the original sine-wave data. The accuracy of the
harmonic fit can be related, in turn, to the signal-to-noise ratio (SNR)
defined bv

From (13) it follows that

where now the input power, Pa,


is computed for the compressed sinewave amplitudes. If the SNR is large, then the MSE is small and
the harmonic fit is very good, which indicates that the input speech
is most likely voiced. For small SNR, on the other hand, the MSE is
large and the harmonic fit is quite poor which indicates that the input
speech is more likely t o be unvoiced. Therefore, the degree of voicing is
functionally dependent on the SNR. Although the determination of the
exact functional form is difficult, one that has proven useful in several
speech applications is the following:

P,

1
Q(SNR-4)
0

4dB

S N R > lOdB
S N R 5 10dB
SNR < 4dB

(26)

where P, represents the probability that speech is voiced, and the SNR
is expressed in dB. The voicing probability concept has proven useful
in a number of speech applications [7] and, in particular, has been
used in the Sinusoidal Transform Coder [S] to provide a mixed voicing
excitation.

+ F]

IMPLEMENTATION
In one implementation of the MSE pitch extractor the speech was
sampled at 10 kHz and Fourier analyzed using a 512-point FFT. The
sine-wave amplitudes and frequencies were determined over a 1000 Hz
bandwidth. In Figure l(a), the measured amplitudes and frequencies are shown along with the piecewise-constant SEEVOC envelope.
Squareroot compression has been applied to the amplitude data. Figure I(b) is a plot of the first term in (22) over a pitch range from 38 HZ
to 400 Hz and the inherent ambiguity of the correlator is apparent. It
should be noted that most of the time the peak at the correct pitch
is largest, but during steady vowels the ambiguous behavior illustrated
in the figure commonly occurs. Figure l(c) is a plot of the overall
MSE criterion and the manner in which the ambiguities are eliminated
is clearly demonstrated. Figure l(d) is an illustration of the voicing
probability as a function of the SNR, and for this example the SNR is
about 20 indicating that the speech is clearly voiced.

COARSE PITCH ESTIMATION


The MSE pitch extractor is predicated on the assumption that the
input speech waveform has been represented in terms of the sinusoidal

251

[31 R.P. Lippmann, An Introduction to Computing with Neural Nets,


IEEE ASSP Magazine, pp. 4-22, April 1987.

HARMONIC SINE-WAVE RECONSTRUCTION


Validating the performance of a pitch extractor can be a timeconsuming and laborious procedure, since it requires a comparison with
hand-labeled data. The approach used in the present study was to reconstruct the speech using the harmonic sine-wave model and listening
for pitch errors. The procedure is not quite so straightforward as Eq.
(2) indicates; however, because during unvoiced speech, meaningless
pitch estimates are made which can lead to perceptual artifacts whenever the pitch estimate is greater than about 150 Ha. This is due to
the fact that, in these cases, there are too few sine waves to adequately
synthesize a noiselike waveform. This problem has been eliminated
by defaulting to a fixed low pitch ( M 100 Hz) during unvoiced speech
whenever the pitch exceeds 100 Hz. The exact procedure for doing this
is to first define a voicing dependent cutoff frequency, w,, as
WC(P)

= TP

(27)

which is constrained to be no smaller than 2~(1000Hz/f,). If the


actual pitch estimate is WO then the sine-wave frequencies used in the
reconstruction are

where k is the largest value of k for which k*wo

[4] H. Duifuis, L.F. Willems and R.J. Sluyter, Measurement of Pitch in


Speech: An Implementation of GoldsteinsTheory of Pitch Perception,
J. Acoust. Soc. Am., Vol. 71, No. 6, pp. 1568-1580, June 1982.
[5] L. Rabiner, On the Use of Autocorrelation Analysis for Pitch Detection,? IEEE Trans. on Acoustics, Speech and Signal Processing, Vol.
ASSP-25, No. 1, pp. 24-33, February 1977.
[6] D.B. Paul, The Spectral Envelope Estimation Vocoder, IEEE Trans.
on Acoustics, Speech and Signal Processing,Vol. ASSP-29, pp. 786-794,
1981.
[7] T.F. Quatieri and R.J. McAulay, Phase Coherence in Speech Reconstruction for Enhancement and Coding Applications, Proc. IEEE
ICASSP89, Glasgow, Scotland, pp. 207-209, May 1989.
[a] R.J. McAulay, T.M. Paxks, T.F. Qnatieri, and M. Sabin, Sine-wave
Amplitude Coding at Low Data Rates, IEEE Workshop on Speech
Coding, Vancouver, B.C., Canada, September 1989.
[9] M.R. Schroeder, Period Histogram and Product Spectrum: New Methods for Fundamental-Frequency Measurement, Journal of the Acoustical Society of America, Vol. 43, No. 4, pp. 829-834, 1968.
[IO] R. Linggard and W. Millar, Pitch Detection Using Harmonic Histograms, Speech Communication 1, pp. 113-124, 1982.
[ll] S. Seneff, Real-Time Harmonic Pitch Detector, IEEE Trans. on
Acoustics, Speech and Signal Processing, Vol. ASSP-26, No. 4, pp. 358365, August 1978.

5 w,(Pv), and where

= 2r(100/fS)).
= kwo for all k. The harmonic recon-

w,, the unvoiced pitch corresponds to 100 Hz (i.e., U,

Note that if W O < U,, then


struction then becomes

wk

5000

i(n; .O)

= xA(wb)etp[j(nwk

+ $k)]

(29)

k=l

-5000

4 0

-20

20

40

TIME (ms)

where $k is the phase of the STFT a t frequency W k . Strictly speaking,


this procedure is harmonic only during strongly-voiced speech since
if the speech is a voiced/unvoiced mixture the frequencies above the
cutoff, although equally spaced by w,, are aharmonic, since they are
themselves not multiples of a fundamental pitch.

(b) CORRELATOR OUTPUT

(a) AMPLITUDE ENVELOPES


1

The synthetic speech produced by this model is of very high quality, almost perceptually equivalent to the original. Not only does this
validate the performance of the MSE pitch extractor, but it also shows
that if the amplitudes and phases of the harmonic representation could
be efficiently coded, then only the pitch and voicing are needed to code
the information in the sine-wave frequencies.

6
4

CONCLUSIONS

A new technique for estimating the pitch of a speech waveform


has been developed that fits a harmonic set of sine waves to the input
data using a mean-squared-error criterion. By exploiting a sinusoidal
model for the input speech waveform a new criterion was derived that
was inherently unambiguous, had pitch-adaptive resolution, used small
signal suppression to provide enhanced discrimination, and used amplitude compression to eliminate the effects of pitch-formant interaction.
It was found that the normalized MSE proved to be a powerful d i e
criminant for estimating the likelihood that a given frame of speech
is voiced. The new pitch estimator/voicing detector has proven to be
useful for lowrate speech coding, speech enhancement and for time- and
pitch-scale modification of speech.

1 0

100

200

Hz

( c ) MEAN-SQUARED-ERROR

(d)VOICING PROBABILITY

Hz

SNR

References
[I] R.J.McAulay and T.F. Quatieri, Speech Analysis/Synthesis Based on

a Sinusoidal Representation, IEEE Trans. on Acoustics, Speech and


Signal Processing, Vol. ASSP-34, No. 4, pp. 744-754, August 1986.
[3] H. Van Trees, Detection Estimotion and Modulation Theory, Port I(Wiley, New York) 1968.

Figure 1

252

300

kHz

400

You might also like