Professional Documents
Culture Documents
9
PITCH ESTIMATION AND VOICING DETECTION
BASED ON A SINUSOIDAL SPEECH MODEL
ABSTRACT
A new technique for estimating the pitch of a speech waveform
is developed that fits a harmonic set of sine waves to the input data
using a mean-squared-error (MSE) criterion. By exploiting a sinusoidal
model for the input speech waveform, a new pitch estimation criterion is
derived that is inherently unambiguous, uses pitch-adaptive resolution,
uses small-signal suppression to provide enhanced discrimination, and
uses amplitude compression to eliminate the effects of pitch-formant
interaction. The normalized minimum mean-squared-error proves to
be a powerful discriminant for estimating the likelihood that a given
frame of speech is voiced.
INTRODUCTION
An analysis/synthesis system has been developed based on a sinusoidal representation for speech that leads to synthetic speech that
is essentially perceptually indistinguishable from the original [l].The
question arises as to whether the parameters of the sinusoidal model,
the amplitudes, frequencies and phases of the underlying sine waves,
can be coded at low data rates (2400-4800 b/s) and result in a highquality speech compression system. Although straightforward coding
of each of the parameters would lead to the most robust system, the
attendant data rate would be too high, hence speech-specific properties
must be introduced to reduce the size of the parameter set to be quantized. One of the fundamental models used in low-rate coding is the
assumption that voiced speech is periodic, which suggests that perhaps
the sine-wave frequencies could he coded in terms of a harmonic series.
In this paper an algorithm is derived that fits a harmonic set of
sine waves to the measured set of sine waves for voiced and unvoiced
speech. The accuracy of the harmonic fit becomes an indicator of the
voicing state and is used to define the probability of voicing which can
be used to allow for mixed voiced/unvoiced excitations. The method
has proven to be a powerful pitch estimation technique that has found
wide application beyond the original lowrate coding problem.
K(wa)
i(n;wo, $1 =
+ 4t)l
(2)
over w,, and 4, since this at least insures robustness against additive
white Gaussian noise [2]. The MSE in Eq. (3) can be expanded as
(4)
If the sinusoidal representation for s(n), Eq. (l), is used in the first
term of Eq. (4), then the power in the measured signal can be defined
Substituting Eq. (2) in the second term of Eq. (4) leads to the relation
NI 2
A(kwo)exppli(nkwo
k=l
NI 2
K(w0)
s(n)i*(n; W O , 4) =
n=-N/2
A(kwo)ezp(jdk)
k=l
s(n)ezp(-jnkwo)
n=-N/2
(6)
As a first step in the analysis procedure, it is assumed that a frame
of the input speech waveform has already been analyzed in t e r m of its
sinusoidal components using the technique described in [l].The speech
measured data, s(n) can therefore be represented as
L
s(n)
Arezpplj(nwr
+ 811
Finally, substituting Eq. (2) in the third term of Eq. (4) leads to the
relation
NI2
(1)
I=1
where {Al,wl,
represent the amplitudes, frequencies, and phases
of the L measured sine waves. The goal is to try to represent this
249
CH2847-2/90/0000-0249 $1.00 6 1990 IEEE
p(-)
-2
K(wo)
~ e
A(kwo)ezp(-j~k)S(kwo)
A2(kwo)
(9)
Since the phase parameters {$k}fLYo) only affect the second term in
Eq. (9), the MSE will be minimized by choosing
i k
= ars[S(kwo)l
K(w0)
PITCH-ADAPTIVE RESOLUTION
In the above formulation it was implied that the analysis window
was fixed a t N
1 samples. This would mean that the main lobe of
the sinc-function, which measures the distance of the measured sinewave frequencies from the harmonic candidates (i.e., sinc(wc - kwo)
(we - kwo)' for Iw( - kwol small) would be fixed for all pitch candidates. This is contrary to the fact that the ear is perceptually tolerant
to larger errors in the pitch a t high pitch frequencies than at lower
pitch frequencies. Moreover, the sine-function distance measure of the
error is meaningful only over each harmonic lobe. These effects can
be accounted for by defining the distance function D ( z ) at the kfth
harmonic lobe t o be
K(wo)
A(kWO)lS(kWO)l
k=1
(19)
(10)
m = 2,3,. .
which shows that the MSE criterion leads to unambiguous pitch estimates. This is possibly its most significant attribute, as it has been
found through extensive experimentation that the usual problems with
pitch period doubling do not occur with this metric. However, the
frequency domain implementation can lead to additional processing
advantages, the first of which is pitch-adaptive resolution.
K(wo)
k=l
k=l
< p(w*)
P(kW0)
(11)
k=l
The unknown pitch affects only second and third terms in Eq. ( l l ) ,
and these can be combined by defining
Since the first term is a known constant, the minimum-mean-squarederror (MMSE) is obtained by maximizing p(w0) over WO.
ENHANCED DISCRIMINATION
S(w) = C A t e z p ( j 0 l ) sinc(wl - w)
(14)
k l
where
Since the sine waves are well-resolved, the magnitude of the STFT can
then be approximated by
L(kW0) = {w : kwo -
K(wo)
(21)
2 -
(17)
k l
250
model. This implicitly assumes that the analysis has been performed
using a Hamming window approximately 2.5 times the average pitch.
Moreover, the SEEVOC technique also assumes that an estimate of the
average pitch is available. It seems, therefore, that the pitch has to be
known in order to estimate the average pitch, in order to estimate the
pitch. This circular dilemma can be broken by using some other method
to estimate the average pitch based on a fixed window. Since only an
average pitch value is needed, the estimation technique does not have
to be accurate on every frame; hence, any of the well-known techniques
can be used. In a future paper, a method using the sinusoidal model
and the MSE criterion will be described that has the advantages of
the present technique but which operates on a fixed analysis window
and requires no amplitude estimate. It is not as reliable as the present
method, but it is good enough to estimate the average pitch.
VOICING DETECTION
In the context of the sinusoidal model the degree to which a given
frame of speech is voiced is determined by the degree to which the
harmonic model fits the original sine-wave data. The accuracy of the
harmonic fit can be related, in turn, to the signal-to-noise ratio (SNR)
defined bv
P,
1
Q(SNR-4)
0
4dB
S N R > lOdB
S N R 5 10dB
SNR < 4dB
(26)
where P, represents the probability that speech is voiced, and the SNR
is expressed in dB. The voicing probability concept has proven useful
in a number of speech applications [7] and, in particular, has been
used in the Sinusoidal Transform Coder [S] to provide a mixed voicing
excitation.
+ F]
IMPLEMENTATION
In one implementation of the MSE pitch extractor the speech was
sampled at 10 kHz and Fourier analyzed using a 512-point FFT. The
sine-wave amplitudes and frequencies were determined over a 1000 Hz
bandwidth. In Figure l(a), the measured amplitudes and frequencies are shown along with the piecewise-constant SEEVOC envelope.
Squareroot compression has been applied to the amplitude data. Figure I(b) is a plot of the first term in (22) over a pitch range from 38 HZ
to 400 Hz and the inherent ambiguity of the correlator is apparent. It
should be noted that most of the time the peak at the correct pitch
is largest, but during steady vowels the ambiguous behavior illustrated
in the figure commonly occurs. Figure l(c) is a plot of the overall
MSE criterion and the manner in which the ambiguities are eliminated
is clearly demonstrated. Figure l(d) is an illustration of the voicing
probability as a function of the SNR, and for this example the SNR is
about 20 indicating that the speech is clearly voiced.
251
= TP
(27)
= 2r(100/fS)).
= kwo for all k. The harmonic recon-
wk
5000
i(n; .O)
= xA(wb)etp[j(nwk
+ $k)]
(29)
k=l
-5000
4 0
-20
20
40
TIME (ms)
The synthetic speech produced by this model is of very high quality, almost perceptually equivalent to the original. Not only does this
validate the performance of the MSE pitch extractor, but it also shows
that if the amplitudes and phases of the harmonic representation could
be efficiently coded, then only the pitch and voicing are needed to code
the information in the sine-wave frequencies.
6
4
CONCLUSIONS
1 0
100
200
Hz
( c ) MEAN-SQUARED-ERROR
(d)VOICING PROBABILITY
Hz
SNR
References
[I] R.J.McAulay and T.F. Quatieri, Speech Analysis/Synthesis Based on
Figure 1
252
300
kHz
400