You are on page 1of 4

2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2013, New Paltz, NY

A PROBABILISTIC LINE SPECTRUM MODEL FOR MUSICAL INSTRUMENT SOUNDS


AND ITS APPLICATION TO PIANO TUNING ESTIMATION.

François Rigaud1∗ , Angélique Drémeau1 , Bertrand David1 and Laurent Daudet2†


1
Institut Mines-Télécom; Télécom ParisTech; CNRS LTCI; Paris, France
2
Institut Langevin; Paris Diderot Univ.; ESPCI ParisTech; CNRS; Paris, France

ABSTRACT The model presented in this paper is inspired by these two last
references [2, 3]. The key difference is that we here focus on piano
The paper introduces a probabilistic model for the analysis of line tones, which have the well-known property of inharmonicity, that in
spectra – defined here as a set of frequencies of spectral peaks with turn influences tuning. This slight frequency stretching of partials
significant energy. This model is detailed in a general polyphonic should allow, up to a certain point, disambiguation of harmonically-
audio framework and assumes that, for a time-frame of signal, the related notes. Reversely, from the set of partials frequencies, it
observations have been generated by a mixture of notes composed should be possible to estimate the inharmonicity and tuning pa-
by partial and noise components. Observations corresponding to rameters of the piano. The model is first introduced in a general
partial frequencies can provide some information on the musical audio framework by considering that the frequencies corresponding
instrument that generated them. In the case of piano music, the to local maxima of a spectrum have been generated by a mixture of
fundamental frequency and the inharmonicity coefficient are intro- notes, each note being composed of partials (Gaussian mixture) and
duced as parameters for each note, and can be estimated from the noise components. In order to be applied to piano music analysis,
line spectra parameters by means of an Expectation-Maximization the F0 and inharmonicity coefficient of the notes are introduced as
algorithm. This technique is finally applied for the unsupervised es- constraints on the means of the Gaussians and a maximum a poste-
timation of the tuning and inharmonicity along the whole compass riori EM algorithm is derived to perform the estimation. It is finally
of a piano, from the recording of a musical piece. applied to the unsupervised estimation of the inharmonicity and tun-
Index Terms— probabilistic model, EM algorithm, polyphonic ing curves along the whole compass of a piano, from isolated note
piano music recordings, and then from a polyphonic piece.

1. INTRODUCTION 2. MODEL AND PROBLEM FORMULATION

Most algorithms dedicated to audio applications (F0 -estimation, 2.1. Observations


transcription, ...) consider the whole range of audible frequencies
In time-frequency representations of music signals, the information
to perform their analysis, while besides attack transients, the en-
contained in two consecutive frames is often highly redundant. This
ergy of music signals is often contained into only a few frequency
suggests that in order to retrieve the tuning of a given instrument
components, also called partials. Thus, in a time-frame of music
from a whole piece of solo music, a few independent frames lo-
signal only a few frequency-bins carry information relevant for the
calized after note onset instants should contain all the information
analysis. By reducing the set of observations, ie. by keeping only
that is necessary for processing. These time-frames are indexed by
the few most significant frequency components, it can be assumed
t ∈ {1...T } in the following. In order to extract significant peaks
that most signal analysis tasks may still be performed. For a given
(i.e. peaks containing energy) from the magnitude spectra a noise
frame of signal, this reduced set of observations is here called a
level estimation based on median filtering (cf. appendix of [4]) is
line spectrum, this appellation being usually defined for the discrete
first performed. Above this noise level, local maxima (defined as
spectrum of electromagnetic radiations of a chemical element.
having a greater magnitude than K left and right frequency bins)
Several studies have considered dealing with these line spectra are extracted. The frequency of each maximum picked in a frame
to perform analysis. Among them, [1] proposes to compute tonal t is denoted by yti , i ∈ {1...It }. The set of observations for each
descriptors from the frequencies of local maxima extracted from frame is then denoted by yt (a vector of length It ), and for the
polyphonic audio short-time spectra. In [2] a probabilistic model whole piece of music by Y = {yt , t∈{1...T }}. In the following
for multiple-F0 estimation from sets of maxima of the Short-Time of this document, the variables denoted by lower case, bold lower
Fourier Transform is introduced. It is based on a Gaussian mix- case and upper case letters will respectively correspond to scalars,
ture model having means constrained by a F0 parameter and solved vectors and sets of vectors.
as a maximum likelihood problem by means of heuristics and grid
search. A similar constrained mixture model is proposed in [3] to
model speech spectra (along the whole frequency range) and solved 2.2. Probabilistic Model
using an Expectation-Maximization (EM) algorithm.
If a note of music, indexed by r ∈ {1...R}, is present in a time-
∗ This work is supported by the DReaM project of the French Agence Na- frame, most of the extracted local maxima should correspond to
tionale de la Recherche (ANR-09-CORD-006, under CONTINT program). partials related by a particular structure (harmonic or inharmonic
† Also at Institut Universitaire de France. for instance). These partial frequencies correspond to the set of

978-1-4799-0972-8/13/$31.00 ©2013IEEE
2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2013, New Paltz, NY

parameters of the proposed model. It is denoted by θ, and in a gen- 2.3. Estimation problem
eral context (no information about the harmonicity or inharmonicity
of the sounds) can be expressed by θ = {fnr |∀n ∈ {1...Nr }, r ∈ In order to estimate the parameters of interest θ, it is proposed to
{1...R}}, where n is the rank of the partial and Nr the maximal solve the following maximum a posteriori estimation problem:
rank considered for the note r.
X
(θ? , {Ct? }t , {Pt? }t ) = argmax log p(yt , Ct , Pt ; θ), (8)
In order to link the observations to the set of parameter θ, the θ,{Ct }t ,{Pt }t t
following hidden random variables are introduced:
where
• qt ∈{1...R}, corresponding to the note that could have generated X
the observations yt . p(yt , Ct , Pt ; θ) = p(yt , Ct , Pt , qt = r; θ). (9)
• Ct = [ctir ](i,r)∈{1...It }×{1...R} gathering Bernoulli variables r
specifying the nature of the observation yti , for each note r. Solving problem (8) corresponds to the estimation of θ, joint to a
An observation is considered belonging to the partial of a note clustering of each observation into noise or partial classes for each
r if ctir = 1, or to noise (non-sinusoidal component or partial note. Note that the sum over t of Eq. (8) arises from the time-frame
corresponding to another note) if ctir = 0. independence assumption (justified in Sec. 2.1).
• Pt = [ptir ](i,r)∈{1...It }×{1...R} corresponding to the rank of the
partial n of the note r that could have generated the observation yti
2.4. Application to piano music
provided that ctir =1.
The model presented in Sec. 2.2 is general since no particular struc-
Based on these definitions, the probability that an observation ture has been set on the partial frequencies. In the case of piano
yti has been generated by a note r can be expressed as: music, the tones are inharmonic and the partials frequencies related
to transverse vibrations of the (stiff) strings can be modeled as:
p(yti |qt =r; θ) = p(yti |ctir =0, qt =r) · p(ctir =0|qt =r) p
fnr = nF0r 1 + Br n2 , n ∈ {1...Nr }. (10)
X
+ p(yti |ptir =n, ctir =1, qt =r; θ) (1)
n F0r corresponds to the fundamental frequency (theoretical value,
· p(ptir =n|ctir =1, qt =r) · p(ctir =1|qt =r). that does not appear as one peak in the spectrum) and Br to the
inharmonicity coefficient. These parameters vary along the com-
It is chosen that the observations that are related to the partial n
pass and are dependent on the piano type [5]. Thus, for appli-
of a note r should be located around the frequencies fnr accord-
cations to piano music, the set of parameters can be rewritten as
ing to a Gaussian distribution of mean fnr and variance σr2 (fixed
θ = {F0r , Br , ∀r ∈ {1, R}}.
parameter):
p(yti |ptir =n, ctir =1, qt =r; θ) = N (fnr , σr2 ), (2) 3. OPTIMIZATION
p(ptir =n|ctir =1, qt =r) = 1/Nr . (3)
Problem (8) has usually no closed-form solution but can be solved
On the other hand, observations that are related to noise are chosen in an iterative way by means of an Expectation-Maximization (EM)
to be uniformly distributed along the frequency axis (with maximal algorithm [6]. The auxiliary function at iteration (k+1) is given by
frequency F ): (k) (k)
Q(θ, {Ct }t , {Pt }t |θ(k) , {Ct }t , {Pt }t ) = (11)
p(yti |ctir =0, qt =r) = 1/F. (4) XX X
ωrt · log p(yti , ctir , ptir , qt =r; θ)
Then, the probability to obtain a noise or partial observation t r i
knowing the note r is chosen so that:
where,
· if It > Nr :
4 (k) (k)
ωrt = p(qt =r|yt , {Ct }t , {Pt }t ; θ(k) ), (12)

(It − Nr )/It if ctir = 0,
p(ctir |qt =r) = (5)
It >Nr Nr /It if ctir = 1. is computed at the E-step knowing the values of the parameters at
This should approximately correspond to the proportion of observa- iteration (k). At the M-step, θ, {Ct }t , {Pt }t are estimated by max-
tions associated to noise and partial classes for each note. imizing Eq. (11). Note that the sum over i in Eq. (11) is obtained
· if It ≤ Nr : under the assumption that in each frame the yti are independent.

1 −  if ctir = 0, 3.1. Expectation
p(ctir |qt =r) = (6)
It ≤Nr  if ctir = 1,
According to Eq. (12) and model Eq. (1)-(7)
with   1 (set to 10−5 in the presented results). This latter expres- It
sion means that for a given note r at a frame t, every observation Y
ωrt ∝ p(yti , qt =r, c(k) (k)
tri , ptri ; θ
(k)
)
should be mainly considered as noise if Nr (its number of partials),
i=1
is greater than the number of observations. This situation may occur Y
for instance in a frame in which a single note from the high treble ∝ p(qt =r) · p(yti |qt =r, c(k) (k)
tir ) · p(ctir |qt =r) (13)
range is played. In this case, only a few local maxima are extracted i/
(k)
ctir=0
and lowest notes, composed of much more partials, should not be Y
considered as present. · p(yti |qt =r, c(k) (k)
tir , ptir , θ
(k)
) · p(p(k) (k) (k)
tir |ctir , qt =r) · p(ctir |qt =r),
Finally, with no prior information it is chosen i/
(k)
ctir=1

p(qt =r) = 1/R.


PR
(7) normalized so that ωrt = 1 for each frame t.
r=1
2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2013, New Paltz, NY

3.2. Maximization way (i.e. without knowing which notes are played). The recordings
are taken from SptkBGCl grand piano synthesizer (using high qual-
The M-step is performed by a sequential maximization of Eq. (11): ity samples) of MAPS database1 .
• First, estimate ∀ t, i and qt = r the variables ctir and ptir . The observation set is built according to the description given
As mentioned in Sec. 2.3, this corresponds to a classification in Sec. 2.1. The time-frames are extracted after note onset instants
step, where each observation is associated, for each note, to noise and their length is set to 500 ms in order to have a sufficient spectral
class (ctir = 0) or partial class with a given rank (ctir = 1 and resolution. The FFT is computed on 215 bins and the maxima are
pitr ∈ {1...Nr }). This step is equivalent to a maximization of extracted by setting K = 20. Note that for the presented results,
log p(yti , ctir , ptir | qt = r; θ) which, according to Eq. (1)-(7), can the knowledge of the note onset instants is taken from the ground
be expressed as: truth (MIDI aligned files). For a complete blind approach, an onset
(c(k+1) (k+1) detection algorithm should be first run. This should not significantly
tir , ptir ) = (14)
( affect the results that are presented since onset detection algorithms
− log F + log p(ctir =0|qt =r), usually perform well on percussive tones. Parameter σr is set to 2
argmax −(yit −fnr ).2

({0,1},n) 2
2σr
− log Nr 2πσr + log p(ctir =1|qt =r). Hz for all the notes and Nr maximal value is set to 40.

• Then, the estimation of θ is equivalent to (∀r ∈ {1...R}) 4.1. Estimation from isolated notes
(k+1)
X Xh
(F0r , Br(k+1) ) =argmax ωrt log p(c(k+1)
tir =1|qt =r) The ability of the model/algorithm to provide correct estimates of
F0r ,Br t (k+1) (Br , F0r ) on the whole piano compass is investigated here. The
i/ctir =1
 q 2 i set of observations is composed of 88 frames (jointly processed),
+1)2 one for each note of the piano (from A0 to C8, with MIDI index in
− yti − p(k+1)
tir F 0r 1 + Br p(k
tir (15)
[21, 108]). R is set equal to 88 in order to consider all notes. The
For F0r , canceling the partial derivative of Eq. (15) leads to the results are presented on Fig. 1. Subplot (a) depicts the matrix ωrt in
following update rule: linear and decimal log. scale (x and y axis respectively correspond
q to the frame index t and note r in MIDI index). The diagonal struc-
P P (k+1) (k+1)2 ture can be observed up to frame t=65: the algorithm detected the
t ω rt (k+ 1) yti · p tir · 1 + Br ptir
(k+1) i/ctir =1 good note in each frame, up to note C]6 (MIDI index 85). Above,
F0r = P +1)2 +1)2
.
P
p(k · (1 + Br p(k the detection is not correct and leads to bad estimates of Br (sub-
t ωrt i/c
(k+1)
=1 tir tir )
tir plot (b)) and F0r (subplot (c)). For instance, above MIDI note 97,
(16) (Br , F0r ) parameters stayed fixed to their initial values. These dif-
For Br , no closed-form solution can be obtained from the partial ficulties in detecting and estimating the parameters for these notes
derivative of Eq. (15). The maximization is thus performed by in the high treble range are common for piano analysis algorithms
means of an algorithm based on the Nelder-Mead simplex method. [5]: in this range, notes are composed of 3 coupled strings that pro-
duce partials that do not fit well into the inharmonicity model Eq.
3.3. Practical considerations (10). The consistency of the presented results may be qualitatively
The cost-function (cf. maximization Eq. (8)) is non-convex with evaluated by refering to the curves of (B, F0 ) obtained on the same
respect to (Br , F0r ) parameters. In order to prevent the algorithm piano by a supervised method, as depicted in Fig. 5 from [5].
from converging towards a local maximum, a special care must be
taken to the initialization. 4.2. Estimation from musical pieces
First, the initialization of (Br , F0r ) uses a mean model of inhar-
Finally, the algorithm is applied to an excerpt of polyphonic music
monicity and tuning [5] based on piano string design and tuning rule
(25 s of MAPS MUS-muss 3 SptkBGCl file) containing notes in the
invariants. This initialization can be seen, depicted as gray lines, on
range D]1- F ]6 (MIDI 27-90) from which 46 frames are extracted.
Fig. 1(b) and 1(c) of Sec. 4. Moreover, to avoid situations where the
66 notes, from A0 to C7 (MIDI 21-96), are considered in the model.
algorithm optimizes the parameters of a note in order to fit the data
This corresponds to a reduction of one octave in the high treble
corresponding to another note (eg. increasing F0 of one semi-tone),
range where the notes, rarely used in a musical context, cannot be
(Br , F0r ) are prevented from being updated over limit curves. For
properly processed, as seen in Sec. 4.1.
B, these are depicted as gray dashed-line in Fig. 1(b). The limits
The proposed application is here the learning of the inharmonic-
curves for F0 are set to +/− 40 cents of the initialization.
ity and tuning curves along the whole compass of a piano from a
Since the deviation of the partial frequencies is increasing with
generic polyphonic piano recording. Since the 88 notes are never
the rank of partial (cf. Eq. (10)), the highest the rank of the partial,
present in a single recording, we estimate (B, F0 ) for the notes
the less precise its initialization. Then, it is proposed to initialize
present in the recording and, from the most reliable estimates, apply
the algorithm with a few partials for each note (about 10 in the bass
an interpolation based on physics/tuning considerations [5]. In or-
range to 3 in the treble range) and to add a new partial every 10 iter-
der to perform this complex task in an unsupervised way, an heuris-
ations (number determined empirically) by initializing its frequency
tic is added to the optimization and a post-processing is performed.
with the current (Br , F0r ) estimates.
At each iteration of the optimization, a threshold is applied to ωrt
in order to limit the degree of polyphony to 10 notes for each frame
4. APPLICATIONS TO PIANO TUNING ESTIMATION t. Once the optimization is performed, the most reliable notes are
kept according to two criteria. First, a threshold is applied to the
It is proposed in this section to apply the algorithm to the estimation matrix ωrt so that elements having values lower than 10−3 are set
of (Br , F0r ) parameters from isolated note recordings covering the
whole compass of pianos and polyphonic pieces, in an unsupervised 1 http://www.tsi.telecom-paristech.fr/aao/en/category/database/
2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2013, New Paltz, NY

1 0

100 0.9 100


ï10
90 0.8 90

note r (in MIDI index)


note r (in MIDI index)

ï20
80 0.7 80
0.6 ï30 detected notes
70 70
0.5 ground truth
ï40
60 60
0.4
50 50 ï50
0.3 30 40 50 60 70 80 90 100
40 40 ï60 note (in MIDI index)
0.2
30 0.1 30 ï70 (a)
0 ï2
Bref
20 40 60 80 20 40 60 80 10
t (frame index) (a) t (frame index)
B
BWC
Bini

B (log. scale)
ï2
10 Blim
ï3
10
B
B (log. scale)

ï3
10
ï4
10

ï4
10 30 40 50 60 70 80 90 100
note (in MIDI index)

(b)
20 30 40 50 60 70 80 90 100 110 25
note (in MIDI index) F
0 ref
20 F0
(b)

F0 (dev. from ET in cents)


F0 WC
15 15
F0ini
F0 10
10
F0 (dev. from ET in cents)

5
0

0 ï5

ï10
ï5 30 40 50 60 70 80 90 100
note (in MIDI index)

ï10
(c)
20 30 40 50 60 70 80 90 100 110 Figure 2: Piano tuning estimation along the whole compass from a
note (in MIDI index)
piece of music. a) Note detected by the algorithm and ground truth.
(c)
b) B in log scale. c) F0 as deviation from ET in cents. (B, F0 ) es-
Figure 1: Analysis on the whole compass from isolated note record- timates are depicted as black ‘+’ markers and compared to isolated
ings. a) ωrt in linear (left) and log10 (right) scale. b) B in log note estimates (gray lines, obtained in Fig. 1). The interpolated
scale and c) F0 as deviation from Equal Temperament (ET) in cents, curves (indexed by WC) are depicted as black dashed lines.
along the whole compass. (B, F0 ) estimates are depicted as black
‘+’ markers and their initialization as gray lines. The limits for the this is the only unsupervised estimation of piano inharmonicity and
estimation of B are depicted as gray dashed-lines. tuning estimation on the whole compass, from a generic extract of
polyphonic piano music. Interestingly, for this task a perfect tran-
to zero. Then, notes that are never activated along the whole set of scription of the music does not seem necessary: only a few reliable
frames are rejected. Second, notes having B estimates stuck to the notes may be sufficient. However, an extension of this model to pi-
limits (cf. gray dashed lines in Fig. 1) are rejected. ano transcription could form a natural extension, but would require
Subplot 2(a) depicts the result of the note selection (notes hav- a more complex model taking account both temporal dependencies
ing been detected at least once) for the considered piece of music. between frames, and spectral envelopes.
A frame-wise evaluation (with MIDI aligned) returned a precision
of 86.4 % and a recall of 11.6 %, all notes detected up to MIDI in-
6. REFERENCES
dex 73 corresponding to true positives, and above to false positives,
all occuring in a single frame. It can be seen in subplots (b) and (c) [1] E. Gómez, “Tonal description of polyphonic audio for music con-
that most of (B, F0 ) estimates (‘+’ markers) corresponding to notes tent processing,” INFORMS Journal on Computing, Special Cluster on
actually presents are consistent with those obtained from the single Computation in Music, vol. 18, pp. 294–304, 2006.
note estimation (gray lines). Above MIDI index 73, detected notes [2] B. Doval and X. Rodet, “Estimation of fundamental frequency of musi-
correspond to false positive and logically lead to bad estimates of cal sound signals,” in Proc. ICASSP, April 1991.
(B, F0 ). Finally, the piano tuning model [5] is applied to interpo- [3] H. Kameoka, T. Nishimoto, and S. Sagayama, “Multi-pitch detection
late (B, F0 ) curves along the whole compass (black dashed lines, algorithm using constrained gaussian mixture model and information
indexed by WC) giving a qualitative agreement with the reference criterion for simultaneous speech,” in Proc. Speech Prosody (SP2004),
measurements. Note that bad estimates of notes above MIDI in- March 2004, pp. 533–536.
dex 73 do not disturb the whole compass model estimation. Further [4] F. Rigaud, B. David, and L. Daudet, “A parametric model of piano tun-
work will address the quantitative evaluation of (B, F0 ) estimation ing,” in Proc. DAFx’11, September 2011.
from synthetic signals, and real piano recordings (from which the [5] ——, “A parametric model and estimation techniques for the inhar-
reference has to be extracted manually [5]). monicity and tuning of the piano,” J. of the Acoustical Society of Amer-
ica, vol. 133, no. 5, pp. 3107–3118, May 2013.
5. CONCLUSION [6] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood
from incomplete data via the EM algorithm,” Journal of the Royal Sta-
tistical Society (B),, vol. 39, no. 1, 1977.
A probabilistic line spectrum model and its optimization algorithm
have been presented in this paper. To the best of our knowledge,

You might also like