You are on page 1of 6

E E E TRANSACTIONS AUDIO AND ON

ELECTROACOUSTICS, VOL.

AU-21, NO. 3, JUNE 1973

149

Real-Time Pitch Extraction by Adaptive Prediction,of the Speech Waveform


JOSEPH N. MAKSYM

X,

QUANTIZER

Abstract-With the exception of relatively sophisticated methods suchascepstrum analysis, the problem of reliable pitch-period extraction has remained largely unsolved. This paper examines the feasibility of pitch-period extraction by means of the nonstationary error process resulting from adaptive-predictive quantization of speech. A real-time hardware system that may be realized at low cost is described.

I. Introduction
Oneof the most important parameters in speech analysis, synthesis, and vocoder applications is the fundamental frequency, or pitch, ofvoiced speech. Its determination, unfortunately,is not easy, and this problem has occupied speech researchers for many years. Of the numerous systems for pitch extraction that have been proposed, none is free from deficiencies either in performance.or in excessive complexity. Recently, anew technique based upon linear prediction of the speech waveformwas proposed by Atal and Hanauer [l] . Their method consists of finding, by a least squares fitting procedure, that recursive digital filter whose impulse response approximates the speech waveform over the interval of analysis. When the recursive expression so derived is used to predict the waveform from its past sample values, the prediction error increases sharply at the onset of each vocal foldexcitation pulse.Provided that the filter coefficients are periodically recomputed in accordance with the variations of the vocal tract during speech, the prediction error remains small, with the exception of pulses showing glottal excitation. These provide instantaneous pitch and voicing indication. The method described above is very powerful, allowing a number of other parameters, such as formant frequencies and bandwidths, to be computed directly from the parameters of the resulting digital filter, but requires explicit measurement of correlations and

5
151

ai y n - i -

(c 1 Fig. 1. Alternative structures suitable extraction.

for use in

pitch

matrix inversion. In this paper it will be shown that pitch extraction, aswellas the detection of voiced speech, is obtainable from simpler systems, such, as the predictive quantizers shown in Fig. 1, provided that suitable algorithms for adaptive adjustment of the predictor coefficients are used. These are developed later in the paper.
II. Pitch-Period Extraction

The known techniques for determining the fundamental frequency of speech may be divided into two categories: those medically oriented methods that attempt direct measurement of the vocal fold closures, such as that described in a recent paper by Fourcin and Aberton [2] ;and signal processing methods operating on the speech waveform. Some of the more successful of the latter are the following: 1)filtering to extract the fundamental; 2) nonlinear processing to accentuate peaks in the waveform; 3) pattern recognition; 4) cepstrum analysis; 5) spectrum flattening; and 6) linear prediction of the waveform. An old, but still used, technique makesuse of a fixed low-pass filter that passes the fundamental component of the waveform but suppresses all harmonics. Sucha system hasseveral obvious faults:first, the Manuscript received November 1, 1972. The author was with the Department of Electrical Engineer- fundamentalcomponent of voiced speech is often ing, Carleton University, Ottawa, Ont., Canada. is He now with the Defence Research EstablishmentAtlantic, Dartmouth, weak or absent, as in the case of telephone speech; secondly, a fixed filter cannot at thesame time satisfy N.S., Canada.

150

IEEE TRANSACTIONS ON AUDIO AND ELECTROACOUSTICS, JUNE 1973

the requirements of male and female speakers whose pitch may differ by as much as three octaves; and finally, the relatively long response time of such a filter implies that the resulting pitch-period indications will appear some time after the onset of voicing, and some short-voiced plosives may not produce any output at all. Tracking filters solve some of these problems, but have difficulty in following rapid pitch variaation, andare still plagued by the response time delay. A more instantaneousmethod is t o extract pitch markers directly from the periodic peaks in the waveform. The detection of single high-amplitude peaks is aided by a zero memory nonlinearity, such as a cubic function. This method fails, however, if no single distinct peak is present, and also during transitions between phonemes. Pattern recognition that attempts to mimic the human ability t o supply pitch markers on the speech waveformhasbeen demonstrated by Gold [3]. Implementation of pattern recognition methods is complex becauseof the necessity for measurement and processing of a large number of features in order t o achieve reliable operation. Somewhat simpler pattern recognition methods have been suggested for use on the speech spectrum by Schroeder [4] and by Harris , and Weiss [5] . Pitchextractionby double-spectrum analysiswas demonstrated by No11 [6] , using the cepstrum technique of Bogart, Healy, andTukey [ 7 ] . The ceptrum analysis method is readily described in terms of the simplified model of the voiced speech process, which assumes a slowly varying linear system driven by a quasi-periodic train of glottal pulses. The shortterm speech spectrum is, then, composed of the closely spaced harmonics of the pulse train, denoted by U ( o) , multiplied by the transfer function G (o ) of the linear system as follows:

lyzed segment is shortened t o a pitch period or less, the pitch indication becomes erratic. A number of more recent methodsfor pitch extraction, including the spectrum flattening technique of Sondhi [8] and linear prediction of the waveform by Atal and Hanauer [l] , involve recovery of the pulse train u ( t ) fromthe speech waveform x ( t ) . Both methods indicate the epoch of occurrence of the glottal excitation. Sondhi's technique isreadily understoodby rewriting G ( w ) in (1) as magnitude and phase functions so that X ( w ) = 1 G ( w )1 e-@(w) - U ( o ) . (3)

This suggeststhat u ( t )might be recovered by a spectral decomposition of x ( t ) , followed by scaling and phase shifting of the spectral components and summation of the results. The method is difficult t o implement since both I G ( w ) I and B ( w ) must be adaptively estimated. Estimation of B (a) may be avoided at the cost of using autocorrelation analysis on thespectrum flattened signal, however.
Pitch Extraction by Linear Prediction

The model for voicedspeechimplied by (1)may be expressed in terms of a recursive digital filter excited once per pitch period by an impulse. Following a terminology similar to thatof [l],
X(2)=
1 I

1- A ( z )

. U(2).

If a linear predictor is used for prediction of the next signal sample x, according t o
m

the transform of the error may be written as

X ( w )= G ( w ) .U(O).

(1)

E ( 2 ) = X ( Z )- B ( X ) X ( Z )

( 6)

By taking the logarithm of the spectral magnitudes, the multiplicative relation is converted t o a sum, Thus

where B ( z ) is the transform of the predictor. After substitution for X ( z ),

I. log I X ( w ) I = log I G ( w )I + log I U(w)


The cepstrum

(2)

E @ )=

is the inverse Fourier transform of

1- B ( 2 ) . U(z). 1- A ( x )

(2), and displays the periodicities in the spectrum as-

sociated with U ( w ) and G(o).The periodicity corresponding t o U ( w ) appears as a single isolated component in the cepstrum at a position in quefrency corresponding to the pitch period. To date, cepstrum analysis providesthe most accurate and reliable source of pitch information at a cost of relatively high complexity of implementation. Cepstrum analysis, nevertheless, has a few deficiencies that are important in some applications. It indicates the pitch period as an averagevalue for the segment of speechwaveform analyzed (typically several pitch periods) and not the epoch of the glottal pulses. Furthermore, if the ana-

If B ( 2 ) e A ( X ) , the error signal e ( t ) , recovered by lowpass filtering of the error sequence { e , } ,approximates the exciting pulse train u ( t ) ,and may be used t o extract pitch information. There is an obvious similarity t o the spectrum-flattening and phase-shifting methods, but estimation of IG (w ) I and B ( w ) is now replaced by predictor adaptation, which is capable of simple implementation.
111. Differential Systems for Pitch Extraction

Examination of the linear prediction method of pitch extraction reveals that two conditions must be

MAKSYM: REAL-TIME PITCH EXTRACTION

151

satisfied: low prediction error between glottal excitations and during unvoiced speech segments, and high prediction error at the onset of glottal excitation. These conditions can be met by the differential quantizers shown i Fig. 1, provided that the coefn ficients ai (not generally the same for the three different configurations) are suitably adjusted to follow the syllabic variation of speech source. For ease of reference, the three configurations of Fig. 1will hereafter be referred to as systems (a), (b), and (c) in conformance with the labeling on the figure. Differential quantizers have the well-known ability to encode with low quantization error signal sequences that exhibit high correlation between samples. Such is the case for speech during voiced segments between glottal excitations,andfor unvoiced segments provided that the sampling rate is sufficiently high. In this paper, sampling rates in the range 40-60 kHz are considered. It should be noted that these high sampling rates are in dired contrast to the 10kHzorlower rates usedby Atal and Hanauer, and allowlow prediction error to berealized with relatively crude predictors and predictor adaptation algorithms. Referring t o system (a) of Fig. 1, and denoting the quantizationerror atthenth sampling instant by q , ,we have

r.elated, system (b) requires a large number of coefficients to achieve a low prediction error. System (c) avoids this problem by including an integrator as part of the predictor. Increasing the sampling rate in system (c) has, therefore, the effect of reducing prediction error even if the number of coefficients is small.
Iterative Adjustment of the Predictor

Adaptive adjustment algorithms for the predictor coefficients in systems (a), (b), and (c) of Fig. 1 can be derived by consideration of mean-squared prediction error. For system (a) this may be written

(, mse = t {x

A~

X,)2}

(11)

where A = ( a l - ,...-,am)T and X , = .. Since (11)is a downward convex quadratic hypersurface in the coefficient values, its minimum is attained when the gradient is zero. That is,

v mse =
-

{(x, - ATX,)X,}

= 0.

(12)

Alternatively, recognizing that the term in the inner parentheses is just e,,

V mse = { e , X , }

0.

(13)

E,

= y,

+ 2 = x, '+ q, ,

(8)

The predictor output sample may then be expressed as


m

2, =
i=l

ai (x,+ + q n - i ) .

This is identical to (5) for small quantization error,as would occurif the quantization were sufficiently fine. The predictor in system (a) is complicated by the necessity to operate upon the samples { 2 - i : i = 1, , . . . ,m} ,since these mustbe stored as many bit digital numbers. This problem is avoided by system (b), which in simplest form uses a binary quantizer and a binary shift register to store the samples {y,+ : i = 1 , 2 , . . . , m} . Assuming identical signal sequences, which we can write in transform notation as X @ ) , and constraining the resulting prediction sequences %(x) to be identical, we obtain

Theoretically, one could obtain optimum the coefficient vector by measurement of the expected values in (12) and solution of the resulting matrix equation. A more readily implemented recursive algorithm may be derived, however, by noting that (13) is a regression function whose root is the optimum coefficient vector. Selection of a small positive constant v as a step-size parameter yields the following recursive algorithm for the coefficient vector:

A(n+l)=A(n)+ve,X,.
A number of modified algorithms in which e , is replaced by the quantized value y, , or in which the instantaneous gradient term e , X , is replaced by its sign, are also possible and lead to essentially the same result for the coefficient vector. This is the case since the regression functions t { y, X,} and t { sgn (e, X , )} haveessentially the same root, ascanbe shown by simulating the system and measuring these expectations [9] . An identical development for system (b) of Fig. 1 yields the following modified algorithm:
. ,

(10)
where, A , (x),A b (x)are the transfer functions of the digital filter blocks for systems (a) and (b), respedively. It should be noted the that number of coefficients in system (b) is not necessarily finite, but that in practice, some finite number suffices to give predictor performance that is only slightly worse than for that system (a). For input signalswhosesamples are highly cor

A ( n + 1) = A ( n ) + v y, Y , ,
while for system (c),

(15)

A ( n + 1)= A ( n )+ v

Yn

i=0

x Y,-i.
n

(16)

The increment in the coefficient vector asgivenby (16) for system (c) is a function of all the past quantized error samples. nonstationary For a signal source model, this is undesirable, and in fact, it is found that

152

IEEE TRANSACTIONS ON AUDIO AND ELECTROACOUSTICS, JUNE 1973

4 -- I
LowpAss (3.1T E R FIL
PREDICTION ERROR

PITCH

knZ)

WAVEFORM

EXTRACTOR

OUTPUT

x(t)-<(il

I
SPEECH

C L O C KB A S I C

BINARY
OUTPUT

DATA S T O R A G E PREDICTED WAVEFORM

<(t)

I
INTEGRATOR

II
COEFFICIENT STORAGE

4
UPDATE0 TAPGAINS

7
DIGITALO T ANALOG CONVERTER

ARITHMETIC UNIT

Fig. 2. Block structure of experimental system. Double lines indicate vector-valuedquantities.

tion. A similar filter at the speechwaveform input selects that part of the energy which is significant to pitch extraction, while suppressing much of the energy in unvoiced speech that is known to be concentrated at higher frequencies. This filter, therefore, aids in keeping the prediction error small during unIV. HardwareImplementation voiced segments of speech. An embodiment of system (c) suitable for real-time The digital implementation of Fig. 2 is by no means pitch extraction isshown in Fig. 2. Eight predictor the simplest or least expensive. It is possible, by storcoefficients, each as an eight-bit binary number, are ing coefficient values as analog voltages on capacitors stored in a recirculating shift-regester memory that is or integrators, to eliminate the 2-MHz clock and asclocked at a nominal 2-MHz rate. Thus, each coef- sociated control circuitry. Because of the various ficient is presented at the memory output in turn feedback and adaptive loops, the performance is not after each successive clock pulse. A similar shift regis- overly sensitive to the usual tolerances of analog comter stores the eight most recent binary quantized pre- ponents, and a substantial decrease in overall cost and diction error samples. This allows coefficient adapta- complexity may be achieved. tion and formation the integrator input in 16 pulses of of the 2-MHz clock. A single 12-bit adder is used for V. Performance both functions: coefficient incrementation by the addition or subtraction of one least significant bit as reThe waveforms obtained from the experimental sysvoicedspeech inputsare shownin quired by algorithm (15) during the first eight clock temfortypical pulses, and formation of the integrator input by sum- Figs. 3 and 4. Fig. 3 shows the waveform at various mation of coefficients with signs determined by the points in the system for the phoneme /i/, as in the quantized error samples in the data store during the word deed. The time scale in the figure is 2 ms/div, next eight clock pulses. .The 2-MHz clock is then shut while the amplitude scale is 5 V/div. Fig. 3 demonoff in readiness for the next pulse from the basic strates the bursts of error that occur in the system clock, which operates at a 40-kHz rate. Integration during glottal excitation,and which, afterfurther takes place over an interval approximately 15 p s in simple processing, may be used to extract pitch inforlength prior to the next basic clock pulse, at which mation. It was found in tests with a wide selection of time a comparator and bistable form the new binary voiced phonemes as input that the duration of the prediction error burst is longest for the phoneme /i/. sample y n . The prediction error x ( t )- 2 ( t ) is low-pass filtered However, even in this case, the epoch of glottal exbyan eight-pole 3.1-kHz cutoff Butterworth active citation may be determined without ambiguity if the filter whose output is used t o extract pitch informa- envelope of the error is used t o trigger, for example,

dependence only upon the most recent vector Y , yields the lowest prediction error [9]. Accordingly, algorithm (15) will be used in the experimental pitch extractor t o be described next.

MAKSYM: REAL-TIME PITCH EXTRACTION

153

(b

(a) 1 Fig. 3. Waveforms in the experimental pitch extractor for the phoneme /i/ in beet. (a) x ( t ) , - Z ( t ) , integrator input, e ( t ) filtered. (b) x ( t ) , e ( t ) , e ( filtered, [ y n } . t)

Fi , 4 . Speech waveform, error waveform, squared-error waveform, and pulse output of experimental pitch extraction system. (a) The phoneme /ae/ in hat. (b) Th;e phoneme /i/ in beet. (c) The phoneme /e/ in bet. (d) The i bought. n phoneme

/x/

a threshold detector which then produces a standard pulse. It is interesting to note that the speech waveform in Fig. 3 contains a large amplitude sinusoidal component at approximately twice the fundamental frequency, but that only one error burst is produced by the system. The correct pitch is therefore determined, whereas methods that use low-pass filters or nonlinear distortion of the speech waveform would have a tendencyto indicate twice the actual pitch. Examples of the application of a square-law nonlinearity to the error waveform are shown in Fig. 4. The effect is to accentuate peaks in the waveform and invert negative pulses while suppressing much of the low-amplitude noise. The four pictures shown in Fig. 4 were obtained with the voiced phonemes /ae/, /i/, /e/, and /I/ spoken in context by a male voice. The timescale is 2 ms/div, and the amplitudescales in a l four parts of the figure are: 5 V/div for the speech l waveform, 10 V/div for the error,2 V/div for the out-

put of the error squaring circuit, and 5 V/div for the pulse output. Although the pictures in Fig. 4 were obtained with the squarederror pulses driving a Schmitt trigger directly, it probably would be safer to first obtain the envelope of the squared-error waveform, as this would avoid the danger of ambiguity for the type of error waveform shown in Fig. 3. A number of tests were conducted to determine if false pitch output pulses would be produced for unvoiced speech. Thesewere not observed, eitherfor unvoiced segments in the context of normal speech, or when the system was deliberately excited with high-amplitude unvoiced phonemes such as /s/ in see or /s/ in she. These tests, although they admittedly do not allow an objective comparison between the experimental system and other forms of pitch extraction, do indicate that the experimental system shows promise as a voicing detector at the same time as it extracts pitch.

154

IEEE TRANSACTIONS ON AUDIO AND

ELECTROACOUSTICS, VOL. AU-21, NO. 3, JUNE 1973

VI. Conclusions

Acknowledgment

A new technique for pitchextractionand voicing Theauthor wishes to thank Dr. D. A. George for indication hasbeen described. It operates by per- suggestingpredictive encoding as a promising area for forming short-term prediction of the speech waveresearch. Healsowishes to thank Dr. L. R.Momis for form, and using the resultant prediction error to de- useful discussions on speech processing. tect the presence of glottal excitation. It was determined thatthe proposed method has several useful References features. Among them are: ease of implementation, B. S. Atal and S; L. HanauerSpeech analysis and synability to respond quickly to glottal excitations at the thesis by linear prediction of the speech wave,J. Acoust. SOC.Amer., vol. 50, no. 2, part 2, 1971. beginning of words, insensitivity to unvoiced speech A. J. Fourcin and E. Aberton, First applications of a sounds, indication of the epoch of glottal excitation new laryngograph, Med. Biol. Illus., vol. 21, July 1971. andnot simply the period, ability to follow rapid B. Gold,Computer program forpitchextraction, J. Acoust. SOC.Amer., vol. 34, July 1962. pitch changes, and the ability to operate onwaveM. R. Schroeder, PeGod histogram and product specforms where the fundamental is weak or absent. These trum: ,New methods for fundamental frequency measurement, J. Acoust. SOC.Amer., vol. 43, no. 4 , 1968. features, many of which are not present in the pitchC. M. Harris and M. R. Weiss, Pitch extraction by comextraction techniques listed in Section 11, would recuter processing of high resolution Fourier analysis data, Acoust. SOC.Amer., vol. 35, Mar. 1963. ommend the proposed system for such applications A. M. Noll, Short-timespectrumand cepstrum techas pitch and voicing input for vocoders, pitch extracni ues for vocal-pitch detection, J. Acoust. SOC.Amer., vo?. 36, Feb. 1964. tion for speech analysis and processing, and speech B.P.Bogert, M. J. R. Healy, and J. W.Tukey, Time Series aids for the deaf. Further research including extenAnelyszs. New York: Wiley, 1963, ch. 15. M. M. Sondihi, New methods of pitch extraction, IEEE sive performance comparisons between the proposed Trans. AudioElectroacoust., vol. AU-16, pp. 262-266, system and other methodsis recommended, however, June 1968. J. N. Maksym, Iterative adjustment of predictive quanbefore an objective evaluation of the new system can tizers, Ph.D. dissertation, Dep. Elec. Eng., Carleton be made. Univ., Ottawa, Ont., Canada, 1972.

3.

Application of a Digital Inverse Filter for Automatic Formant and F, Analysis


JOHN D. MARKEL

Introduction

Abstract-In this paper, a new algorithm based upon a digital inverse filter formulation is presented for automatically determining VU, a voiced-unvoiced decision (VU = 0 during unvoiced speech and VU = 1 during voiced speech), F , , the i fundamental frequency, and Fi, = 1, 2, 3, the first three formant frequencies, as a function of time. Formant trajectory satisfy VU= 1. estimates are obtained for all speech sounds that
Manuscript received April 30, 1972. Thiswork was supported the by Office of Naval Research under Contract N00014-67-C-0118with the Speech Communications Laboratory, Santa Barbara, Calif. 93101. The author is with the Speech Communications Laboratory (SCRL), Santa Barbara, Calif. 93101.

The purpose of this paper is to present a new algorithm for automatically extracting the first three formant frequencies for voiced male speech and the fundamental frequency. Explicit in the fundamental frequency extraction is VU, a voiced-unvoiced decision. The central element in the analysis is the digital inverse filter. Based upon the first M + 1 terms of the input autocorrelation sequence, coefficients of an Mth degree, all-zero digital filter are calculated. The formant trajectory estimates for each frame are based solely upon the locations of the local minima of the corresponding spectrum of the resultant inverse filter. The V U decision is determined by the amplitude of the largest peak of the normalized autocorrelation sequence of the output of the inverse filter (excluding the origin). If V U = 1, then Fo is defined as the reciprocal of the peak location.
Brief Review of Digital Inverse Filter Formulation

The following formulation hasbeen proposed for extracting the resonance behavior from a sequence of preemphasized speech data {x,) [l]. Given a digital inverse filter

You might also like