Professional Documents
Culture Documents
Judith C. Brown
Physics Department,Wellesley College,We/lesley, Massachusetts O12$1and Media Lab, Massachusetts Instituteof Technology, Cambridge, Massachusetts 02139
Miller S. Puckette
IRC4M, 31 rue St Merri, Paris 75004, France
transform(FFT) bin corresponding to the fundamental frequency is chosen by the frequency tracker,an approximation is usedfor the phasechange in the FFT for a time advance of one sample to obtainan extremely precise valuefor thisfrequency. Graphicalexamples are givenfor musicalpassages by a violin executingvibrato and glissando where the fundamentalfrequency changes are rapid and continuous.
PACS numbers: 43.75.Yy, 43.60.Gk
INTRODUCTION
time resolution
Our fundamental frequency tracker is basedupon the calculationof a constantQ spectraltransformdescribed recently by Brown and Puckette (1992). Here we showed that the direct calculationof the components of a constant Q transformcan be accomplished more efficientlyby a
able trade-offbetweentemporaland frequencyresolution. Our compromise consists of limiting the temporalextent of the window.This meansthat the low-frequency bins of our transformaxeconstant frequency resolution(equal to the sample rate over the temporal window length) transformation of the fast Fourier transform (FFT). To rather than constantQ. For example,we may choose for summarize that calculation, the direct calculation can be the centerfrequencies of the bins of the transformto corcarried out using respondto the frequencies of notesof the equal tempered musicalscalebeginning with the first bin at C3 ( 130.9Hz). xq[kq] = w[ n,kcq]x[ n]e-J'k, n, Then a windowlengthof 25 ms meansthat the resolution is a constant equalto 40 Hz (while the Q is variable)up to where Xq[kcq] is the keq component of the constant Q a frequencyof 717.8 Hz or the 30th bin. The resolutionis transform, to[n,kcq ] isa window function of length N[kcq ], then variable with Q constantequal roughly 17 up to a x[n] isasampled function oftime, and cokq isthe frequency frequencyof 5274 Hz or the 65th bin. We will call this of this component. transform the modified constantQ transform, and we will This can be evaluated using the following form of show that limiting the temporalextent of the window for Parseval's equation the low-frequency bins doesnot lead to decreased performance for the detection of the fundamental frequency. It N- I v- does, of course, mean greater spillover for the lowfrequencybins. if we define An exampleof this calculation can be foundin Fig. 1 for a clarinet playing a chromaticscale where we have w[n,ko]eiOkt,"=.7F * [n,kcql . plottedthe amplitudeof the modifiedconstantQ transform Since thefunctions /*[n,kq]have very fewnonzero com- componentsagainstbin number in each frame. Time is increasingvertically for theseframes. Here the lower Q's ponents, the fast Fourier transformX[k] of the signalx[n] (with the lowest value about 5) of the low-frequencybins can be transformed into a constant Q transform with very are manifestedby a greater bandwidth. This figure can be few additional operations. comparedto Fig. 4 of Brown and Puckette (1992) for a For applicationto the performanceof modem comspectrumof the samesoundcalculatedwith a "true" conputer music, which includesextremely rapid passages, a
v[k
x[n]*[n,kc]---- k=o
662
0001-4966/93/94(2)/662/6/$6.00
662
CLARINET
SCALE
80o 700
VIOLIN
SCALE
600
300
121
200
2.53
32.5
.c 65
->
FIG. 2. Frequency trackingresults for a violinplayingtwo octaves of a diatonicscale. The opensquare is the resultfrom the modified constant Q fundamental frequency tracker.The solidtriangleis the precise frequency returnedby the high-resolution phasecalculation.
FIG. 1. Amplitudeof the modified constant Q spectral components plotted against bin number(corresponding to frequency)for a series of time frames. A clarinet with microphone placed in the barrelis playing several
octaves of a chromatic scale.
their counterpartsin the test spectrum. However, with smaller weights on higher components,this error is
avoided.
stantQ of 17. It is interesting to note that the evenharmonics are missing as predicted for a clarinetup to about the 29th bin and at this point the energyin the third harmonicbeginsto decrease. The clarinetsoundwas recorded with the microphone in the barrel of the instrument sothe
spectrumis not as rich as for a normal clarinet sound.
I. CALCULATION OF INITIAL FREQUENCY ESTIMATE
We haveestimated the run time for our algorithmwith calculations carried out on a 40-MHz Intel i860 usinga hand-coded routine. With a 512-pointFFT and quarter tonespacing overthreeoctaves, the FFT takes343/rs and the transform1664-2/s (measured on an oscilloscope).
The cross-correlation calculation involves well under 250
multiplies evenwith ten components in the harmonic pattern so the computation time for this operation is negligible. The overallcomputation time then is on the order of 0.5 ms. This can be compared to a time advance of 25 ms
betweenframes, so the calculation is easily carried out in
real time.
When the constant Q transformof a soundconsisting of harmonicfrequency components is plottedagainstlog frequency,the spacingof thesecomponents is invariant (Brown, 1991). The fundamentalfrequencycan then be determined by findingthe positionin log frequency space of this invariantpattern.This is bestaccomplished by calculatingthe cross-correlation functionof eachframeof the log frequency spectrum with the ideal patternas discussed by Brown (1992). The "ideal pattern" used in calculatingthe crosscorrelationfunctionconsists of components with the frequencyspacing discussed aboveand with amplitudes decreasing linearly from 1 for the fundamentalto 0.6 for the highest harmonic. The purpose of varyingthe amplitudes is to preventthe choiceof the frequency an octavebelow that of the true fundamental.For this positionof the ideal pattern, all even components of the ideal pattern line up with components present in the spectrum. This is because the spacing of components 2f, 4f, 6f, etc. is the sameas that of components f,, 2f, 3f, etc. If all components are weightedequally, the value of the cross-correlation function will havethe samevaluefor evencomponents of the patternaligned with the components of the signal asfor the "true" position whereall components of the patternfall on
663 J. Acoust. Soc.Am.,Vol.94, No. 2, Pt. 1, August 1993
Resultsof our calculationappliedto digitizedviolin and clarinetscales are found in Figs. 2 and 3 where the opensquares correspond to the frequencies of the notesof the equaltempered scalechosen by our frequency tracker plotted againstframe number.The solid triangleswill be explained in the next section. Each framecorresponds to a time advance of 25 ms in the sound. Theseare examples of instruments with very differentspectra.The violin has a complexspectrumwith many higher harmonicspresent. This clarinet soundwas recordedwith the microphonein the barrel and has a relativelysimple spectrumwith a smallernumberof harmonics. The numberof components in the ideal pattern used for the crosscorrelationvaried accordingly,with 10 components for the violin and three components for the clarinet. There were essentially no errors in determiningfundamentalfrequencies of the notes presentwith our modifiedconstantQ transform.
II. HIGH-RESOLUTION FREQUENCY DETERMINATION
A. Phase background
CLARINET
2OOO
SCALE
The frequencyof a particular Fourier component as obtainedfrom the bin into which it falls in the magnitude spectrumis only as accurateas the frequencydifference betweenbins, in our case6% or 3% depending on the IOO0 calculation. This estimate can be improvedby a quadratic fit usingthe amplitudes of the bin containing the maximum and the two adjacentbins and identifying the positionof the maximum of the parabolathus obtained.This is an extremely well-known technique describedrecently by Smith and Serra (1987). We will discuss the accuracyof this approximation in a later section. Even more accurateis a method we have developed time (s) -> basedon an approximationfor the phasechangeper unit samplefor the Fourier component chosenas the correct fundamental frequency by our frequency tracker.It is well FIG. 3. Frequency trackingresults for a clarinetplayingseveral octaves known that the frequency as determined by the changein o a chromatic scale. Theopen square is theresult fromthemodified constant Q fundamental frequency tracker.The solidtriangleis the prephase is muchmoreaccurate thanthat obtained from the eisafrequency returned by the high-resolution phase calculation. magnitudespectrum.However, the problem with deter-
mining thefrequency fromthephase difference overa rehLuis ( 1981 ). They pointout that in a varietyof disciplines phase onlyinformation leads to a morerecognizable reconstruction of the originalobjectanalyzedthan doesinformation basedon magnitude only. Friedman (1985) demonstrated that one can obtain narrower formant bands for
sonable hopsize(samples between frames)is oneof phase unwrapping. The phasechange is only knownmodulo2z. This problemdoesnot arisewith a hop sizeof one sample sincethe highest digital frequency is r radians/sample, but this case necessitates the computationof an additional
sonagrams of speechusing a histogramof occurrences FFT. againstfrequency with the frequency obtainedfrom the In fact this additionalcomputation can be avoidedby time derivative of the phase of the short-time Fourier usingan approximation whichassumes that x[n] is peritransform. Computationally this wasobtained by calculat- odic.The phase change for a hopsizeof onesample canbe ing two short-time Fouriertransform (STFT's) with the obtained from the following identity (Oppenheim and derivative of the window function used in one of them. Schafer,1975; Charpentier,1986). If -{x[n]}=X[k] is Earlier, in work on the phase roeoder applied to the kth component of the discreteFourier transformof speech signalsFlanaganand Golden (1966) usedphase x[n], then differences to obtaingreateraccuracy of Fourier compo-{x[n+m] }_eJ2'km/VX[k] ( 1) nents. Beauchamp(1966, 1969) and Grey and Moorer (1977) used a similartechnique applied to musical signals. is the DFT after m samples. See also Moorer (1978) and Dolson (1986). The above equation applies to an unwindowed DFT. It Charpentiex (1986) described a pitch trackerbased on is possible to usethis resultto obtaina Hanning-windowed frequencies obtained from an approximation for the phase transformsincethe effectof windowingcan be calculated differenceof time frames of the STFT separatedby one in the frequency domainfor this window.We will usethe sample.We obtainedthis sameexpression independently notation XZt[k] to denote theHanning-windowed Fourier and will discuss it in the followingsection. transformevaluatedfor a window beginningon sampleno,
that is
B. Phase
calculation
N--I
Xu[t,no]= x[n+no]w[n]e-J2'kn/
where
works extremely well for instruments playing discrete notes belongingto the equal temperedscale. Here the
smallestfrequencydifference betweennotesis approximately 6%, and the resultsare reportedas notesof the equaltempered scale.However,a very differentsituation can arise in passages playedby stringedinstruments or
wind instruments. These instruments are not constrained
w[ n ] =-- cos(2rn/N)
to play discrete notesas are keyboard instruments. Thus the frequency canvary continuously asin, for example, a glissando or vibrato.Keyboardinstruments may alsobe
664 J. Acoust. Soc. Am.,VoL94, No. 2, Pt. 1, August 1993
(2)
Using Eq. (1) with rn= 1 in Eq. (2), the approximation for the Hanning-windowed DFT after one sample is
TEST
FILE
WITH
THREE
SINUSOIDS
5000
3000
2000
1000
(4)
100
FFT
BIN
NUMBER
->
[Im(Xr[k,no+1] )
flm(XH[k,no] )
as described addsa negligible amountto the computation time sinceit is only carriedout for oneDFT bin. Oncethe
bin number for the initial estimate of the fundamental fre-
To checkthis methoda testfile wasgenerated in software consisting of the superposition of sinusoidal componentsof equal amplitudeat frequencies 434.97, 1739.88, and 3479.77Hz. Thesefrequencies werechosen to fall into bin positions 10.1, 40.4, and 80.8. A 256-pointDFT was taken, and the real and imaginarycomponents obtained were substituted into Eq. (4) usingthe definitions following this equation. The calculation wascarriedout for each of the 128 positive frequency components of the DFT. The result is found in Fig. 4 where we have plotted to(k) (convertedto Hertz for a samplerate of 11025) againstbin number. Note that the calculatedfrequencies
are correct for five or more bins on either side of the bin
quency is selected from the constant Q transform, a calculation is madeto determine the corresponding bin number for the FFT. The real and imaginarypartsof the FFT for this bin and thoseon either sideof it were previously calculated,and only thesethree complexnumbersare needed for the evaluationof Eqs. (2) and (3). Theseare then used in Eq. (4) with the definitions followingit. The precise frequencies were calculated for the violin and clarinet scalesdiscussed previously.The resultsare given by theclosed triangles in Figs.2 and3. Note thatfor the violin the precise frequency is correctin several cases wherethe initial estimatewas off by a bin. For the clarinet in Fig. 3, the precise frequency is correct for the firstthree
notes where the initial estimate was incorrect because the
into which the frequencybelongs.This point will be discussed further in the sectioncomparingthis method with the phasevocoder. The measuredfrequencies as determinedby Eq. (4) were alsoprinted out and were correctto the two decimal placesindicatedabove.The solid diagonalline represents but these errors are small. We are uncertain as to their origin. the centerfrequency of theseDFT binsplottedagainst bin number.The analogous graphobtainedfrom an exactmeasurement of to based on the calculation of two successive
The powerof thismethod is evenmoreapparent when it is applied to the acoustic sounds for whichit is :intended. In Figs.5 and 6 arefoundthe frequencies (solidtriangles) obtainedusingEq. (4) on the output from the frequency trackerdescribed in Sec.I. A violin is executing a glissando in Fig. 5 and vibratoin Fig. 6. For comparison we include the opensquares indicatingthe resultsfor the frequency tracker without the phasecorrection. For the glissando two pointson the precise frequency curveare slightlyoff,
DFT's with a time advance of one samplewas identicalto this graph and is not included. This means that the assumption of periodicity used in Eq. (1) is extremely good
OF PHASE
For a sampled function y(x) where values are only known for integralx, the positionof the "true" maximum of the function usuallyoccursat nonintegralvaluesof x. One widely used meansof approximating the (noninte-
VIOLIN
GLISSANDO
900
TABLE I. Error in quadraticfit calculation.The first column is the position of the true maximumof the functionrelativeto zero.The second columnis the position of the maximumpredicted by the quadraticfit in Eq. (6), and the third columnis the error in the quadraticfit calculation
(difference in columns one and two).
800
r
--0.500 --0.400
--0.300 --0.200
Q fit
--0.500 --0.357
-0.247 --0.156
Error
0.000 --0.043
--0.053 --0.044
700
v vv
600
-- 0.100
0.000
-- 0.076
0.000
-- 0.024
0.000
0.100 0.200
0.300
0.076 0.156
0.247 0.357 0.500
0.024 0.044
0.053 0.043 0.000
500
0.400 0.500
time (s) -
FIG. 5. High-resolutionfrequencyplotted againsttime for a violin executinga glissando. Opensquares represent the results of the fundamental
N-I
values forthese Fourier coefficients, Xn[k0], Xn[k0+1]and Xn[ko - 1].These three Fourier coefficients willbethose
havingthe maximumvaluesand their amplitudes can be substituted in Eq. (5) to obtain the fractionalvalue of k corresponding to the maximum of the parabola fitted through thesethree points.
2(2yo-y+t-y_) '
(5)
The accuracy for the quadraticfit methodcan be calculatedexactly for an input signalconsisting of a single harmonic component. For a component fallinginto bin k0 with a frequencycorresponding exactly to ko+r where -0.5<r<0.5 we evaluateEq. (2)
VIOLIN
800
VIBRATO
600
500
400
300
200
from --0.5 to 0.5 and reportedthem in Table I alongwith the error in the valuegivenby the quadratic fit whichis the differencein the columnslabeled r and Q fit. We also verified thesevaluesexperimentally by generatinga sinusoid with the appropriate r, carryingout a Hanning-windowed FFT analysis for 10 successive frames,and then determining the positionof the maximum using the quadratic fit formula of Eq. (5). The results were identical to those givenin columnII of Table I with zero deviationamong
these results for different frames.
1oo o
It shouldbe notedthat the error is independent of the bin number ko so the fractional error which would be reportedfor an FFT would in fact be the error in the third columnof Table I dividedby k0+ r. Our calculationbasedon phasedifferences from Eq. (4) was also applied to severalof thesefrequencies. The
666
666
TABLE II. Comparison of average frequencies predicted by phase difference approximation (row two) andquadratic fit (row three)withtheir standard deviations from 10 measurements. Row onegives the exactfrequencies in the testsignal.
Frequency
True 133.506
inputsinusoidal component. Our method hasclearadvantages overthe conventional phase vocoder and thusholds promise for musical synthesis aswell asanalysis.
VI. CONCLUSION
s.d.
Freq.
267.012
s.d.
Freq.
400.518
s.d.
133.505 132.478
0.654 267.009 1.277 400.514 1.036 0.4155 265.138 0.771 398.214 0.151
deviationfrom the correctfrequency was lessthan 0.01% with this method.We thus conclude that, for a signalconsistingof a singlecomponent, the phasemethodis more accurate than the quadraticfit. We then determined the effectof "spillover"from adjacentbinsby generating a soundconsisting of the sumof components in exact bin positions3.1, 6.2, and 9.3. The
results for 10 frames are found in Table II.
Our method of trackingthe fundamental frequency of musicalpassages in real time is extremelyaccurateand reportsfrequencies to the nearestquarter tone. Our high resolution frequency determination can be usedas a back end for a fundamental frequency trackerwherehigh precision is desired.Applicationsrange from analysisof sounds with continuous frequency variationto determination of temperament for performance studies in cognitive psychology.
ACKNOWLEDGMENTS
For this casethe phasemethod again givesa more accurate value,but the standard deviation is greater.Sothe confidence in a singlemeasurement would be lower for the phasemethod.It shouldbe noted that the actual error is greaterthan the standarddeviationfor the quadraticfit.
V. COMPARISON TO THE PHASE VOCODER
and to WellesleyCollegefor its generous Sabbatical leave policy.Shewouldlike to thankDan Ellis of the MIT Media Lab for invaluable and illuminating conversations. Finallysheis grateful to JimBeauchamp of the University of Illinois for many helpfule-mail discussions.
Beauchamp, J. W. (1966). "TransientAnalysisof HarmonicMusical Tonesby Digital Computer," Audio Eng. Soc.PreprintNc,.479. Beauchamp, J. W. (1969). "A Computer System for Time-Variant HarmonicAnalysis and Synthesis of MusicalTones,"Musicby Computers, edited by vonFoerster andBeauchamp (Wiley, New York) :.pp. 19-61. Brown,J. C. (1991). "Calculation of a constant Q spectral transform," J.
Acoust. Soc. Am. 89, 425-434.
methodsof conductingphasevocoderanalyses. With the filterbankapproachthe sinusoidal analysis functions maintain a fixedphase with respect to the signal,and one measuresdeviationsfrom center frequencies. On the other hand, the phasevocoderanalysis is often carried out by takingFFT's, wherethe analysis functions do not maintain their initial phasewith respectto the input signal,but
rather restart at zero for each calculation. In this case the
pattern recognition method," J. Acoust. Soc.Am. 92, 1394-1402. Brown,J. C., andPuckette, M. S. (1992). "An efficient algorithm for the calculation of a constant Q transform," J. Acoust.Soc.Am. 92, 26982701.
absolutephase is measured.To compensate, the phase changecorresponding to the bin center frequencyis subtractedfrom the measured phasechange,or an equivalent method,suchas circularlyrotatingsamples in the analysis window,is applied.There are, nevertheless, errorsin bins far removed from the bins in which components of the signalactually fall unlessthe hop size is 1. Theseerrors increaseas the hop size increases so there is a tradeoff betweendata proliferationand accuracy. Our methodis equivalent to a phasevocoder analysis using the FFT method with a hop size of 1 sample.The advantage is that we usethe approximation of Eq. (3) and do not have to perform the second FFT. Thus we measure the absolutefrequencyfor each bin, and get this information from a singleFFT. With our methodwe obtain the exactfrequency of the source for five or so bins on either sideof the bin with centerfrequency closest to that of the
Charpentier, F. J. (1986). "PitchDetection UsingtheShort-Term Phase Spectrum," Proceedings of the International Conference on Acoustics, Speech, andSignalProcessing (IEEE, New York), pp. 113--116.
Dolson,M. (1986). "The Phase Vocoder: A Tutorial," Cornput. MusicJ.
10, 14-27.
An Interpretation of thePhase Structure of Speech," Proceedings of the International Conference on Acoustics, Speech, and SignalProcessing,
(IEEE, New York), pp. 1121-1124.
Grey,J. M., andMoorer,J. A. (1977). "Perceptual evaluations of synthesizedmusical instrument tones," J. Acoust. Soc. Am. 62, 454-462. Moorer, J. A. (1978). "The Use of the Phase Vocoder in Computer
MusicApplications," J. Audio Eng. Soc.26, 42-45. Oppenheim, A. V., andLuis,J. S. (1981). "The Importance of Phase in
Signal,"Proc. IEEE 69, 529-541.
667
667