Professional Documents
Culture Documents
Dragutin Hrenek, Nenad Mik sa, Robert Perica, Pavle Prenta si c and Boris Trubi c
University of Zagreb, Faculty of Electrical Engineering and Computing
Unska 3, 10000 Zagreb, Croatia
E-mail: {dragutin.hrenek,nenad.miksa,robert.perica,pavle.prentasic,boris.trubic}@fer.hr
AbstractIn this paper we describe a dance classication
system for compositions written in MIDI format. The system
recognizes the following dances: tango, polka, mazurka, waltz,
cha-cha-cha and march. The rhytmic structure of a dance is a
nite sequence of notes of specied durations that repeats itself
through the whole composition, so we can hypothesise that the
probability of occurence of specied note duration depends on
the duration of the note before it. Hence the implementation of
the classier is made using Hidden Markov Models. The models
are used in two basic forms the rst assumes discrete note
durations, and the other assumes that note durations conform to
normal distribution. The system was tested using dance-prototype
generated examples with added Gaussian noise, as well as with
human-played examples. The results gathered using both kinds
of examples are comparable. The system was implemented using
the Matlab programming package.
I. INTRODUCTION
Upon hearing a certain sequence of notes or rhythm, a dance
expert or even a dance enthusiast immediately thinks of some
type of dance or movement which would best t the heard
music. Thus, he/she easily recognizes the type of dance or
music that is being played. Computer is not able to do the
same with such ease, as it is unable to focus on a specic
musical instrument in the audio recording.
MIDI (Musical Instrument Digital Interface) format is com-
monly used in musical production, besides mp3, wave and
similar formats. MIDI is a protocol by which computer com-
municates with certain external devices, such as keyboards.
The protocol is based on exchange of messages between the
device and the computer. Those messages can be saved in a le
and interpreted later as an audio or as a note inscription. The
protocol is a standard which is used by all musical instruments
and musical software, but the problem is that most of the
devices and software do not honor the protocol specications
exactly. Therefore, its a common situation when a note
inscription written in one program and saved in MIDI format,
when opened in an another program, is poorly correlated with
the source inscription. The problem resides in the fact that
each note can be written in MIDI format in various ways. This
increases the possibility of misinterpretaion of the recording.
One of the most common notes is a quarter note. In a
4/4 measure it is represented by one tick. For example, if
the tempo is 120 and measure 4/4, then there should be 120
quarter notes in one minute of recording. Thus, every quarter
note should last exactly half a second. But this holds only on
average. Let us assume that each quarter note lasts 100 ticks
of the clock. The each quaver should last 50 ticks and dotted
quaver should last 75 ticks. The dot in the note increases its
duration by 50%. For example, dotted quarter note lasts the
same as the quarter note and the quaver together. If the music
is played by human then the quarter note lasts 100 ticks only
on average, but it can last a bit more or less, e.g. 102 ticks
or 85 ticks. This depends on the melody phrasing and other
factors. Musical inscription software often adds noise to the
duration of notes when saving in the MIDI format in order
to achieve the greater delity of the recording as if it has
been played by a human. This makes the correct interpretation
of the note difcult to the computer, as for example, the note
that lasts 85 ticks is much closer to a dotted quaver than to a
quarter note.
Thus, the quarter note is often not a real quarter note.
This is the reason for wrong interpretation of notes among
different programs. Because of this problem, classication of
dances by rhythmic patterns obtained from MIDI les is a
very challenging problem in computer science. In this paper
we present methods that enable the computer to recognize the
dances based on the human-labeled examples.
The next section gives an overview of the previous works
and solutions of the described problem. The third section
describes a method for classication of musical pieces with
the Hidden Markov Model. The fourth section describes the
results of classication. The fth section concludes the paper
and discusses aspects of future work.
II. PREVIOUS WORK
The described problem is tightly related with the problem of
detecting the rhythmic structure of the musical piece. Takeda
et al. dene the problem as a search for a sequence of states
in a probabilistic model [1]. Since the states are represented
with Hidden Markov Models, the most probable sequence of
states can be nd with the well-known Viterbi algorithm [2].
Therefore, the rhythmic structure is determined by the most
probable sequence of states found by the Viterbi algorithm for
the given sequence of observations. This method is good for
nding the specic rhythmic structure, but it is impractical for
classication of rhythmic structures.
In [3], the system for extraction of musical features from
MIDI recording is described. The described system consists
of more subsystems for carrying out the following tasks:
identifying basic musical objects (notes, pauses, chords, etc.),
searching for accent on each musical object, rhythm recog-
nition, rhythm tracking and note discretization. The rhythmic
structure of the piece is recognized by looking into the time
interval which consists of certain number of notes. This time
interval is determined in advance for each potential rhythmic
structure that is being recognized. The actual notes in that
interval are then compared to the expected notes and then the
classication is performed. This method is not practical for
solving our problem as it does not give good results.
In [4] methods for note duration discretization and methods
for detection and tracking of rhythm are presented. The rhythm
detection in this paper is based on Hidden Markov Models in
such a way that each state of the model represents the moment
in which the note has been played. This enables the modeling
of different moments in which a note can appear. This method
is very useful for converting MIDI recordings into printable
musical inscription.
III. METHOD DESCRIPTION
It is a general trend to use Hidden Markov Models (HMMs)
for solving pattern recognition problems in cases where pat-
terns are time dependant signals, as for example in speech
recognition [2]. MIDI signals are time dependant signals and
they represent a more abstract way for representing music in
computer. It is much easier to extract note characteristics from
MIDI recording than from mp3 or wave. Hence, we think that
it would be a good idea to use HMMs for classication of
musical pieces recorded in MIDI format.
The idea behind HMMs assumes the existence of some set
of states Q = {q
i
}
N
i=1
, where N is the number of states.
For each state we dene probabilities of transition from the
current state into all other states and probability for staying
in current state. Furthermore, for each state we dene its a
priori probability (prior), i.e. the probability that the system
will start in this state. Besides the set of states, there exists a
set of possible outputs of the system V = {v
j
}
M
j=1
, where M
is number of possible outputs. For each state of the system, we
dene the probability that the system will generate a certain
output while being in that state. All that can be formally
written in the following way: Hidden Markov Model is a
tuple
= (, B, )
where is a transition probability matrix, B is an output
probability matrix and is a vector of priors. Elements of
the matrix are a
ij
and represent the probability of transition
from state i to state j, i.e.
a
ij
= P(q
t+1
= j|q
t
= i)
The elements of the matrix B are b
ij
and represent the
probability that the output j will be generated while the system
is in state i, i.e.
b
ij
= P(output = v
j
|q
t
= i)
The elements of the vector are
i
and represent the
probability that the system will start its work in the state i,
i.e.
i
= P(q
1
= i)
As a result of such denition of HMM, it is suitable to
represent it in a form of a directed graph. Vertices of the graph
represent the states of the HMM and the outputs of the system,
while the edges represent possible transitions between states
and possible outputs of the system for each state. The weights
of the edges represent probabilities. An example of a HMM
is shown in gure 1.
Figure 1. An example of a Hidden Markov Model. X represents the states,
Y represents the possible outputs of the system, a represents the transition
probabilities and b represents the probabilities of outputs in each state.
Possible outputs of the system can be continuous too. In
that case for each state we have to model the probability
distribution which will generate the outputs of the system in
that state, e.g. Gaussian distribution. In general, it is possible
to model different probability distribution functions for each
state, but it is common to use the same probability distribution
function in all states, but with different parameters. This
simplies the usage of the model and the learning algorithm.
Possible outputs of the system depend on the problem we try
to model using HMMs. On the other hand, the number of states
is a parameter of the model and thus inuences the complexity
of the learning.
A. Data preparation and feature selection
Our system recognizes dances using their rhythmic struc-
tures. Rhythmic structure is a sequence of notes of certain
duration, i.e. the alternation of sound and silence in time.
Rhythmic structure examples that can be recognized by our
classier are shown in gure 2. The duration of a note is the
only feature that is used by our classier as it is the only
required feature to describe the rhythmic structure of a dance.
B. Note discretization
In cases when we want to test the classication of musical
pieces by having notes represented by their class, we rst need
to perform note discretization, i.e. classify them into some
(a) Tango rhythm
(b) Polka rhythm
(c) Mazurka rhythm
(d) Waltz rhythm
(e) Cha-cha-cha rhythm
(f) March rhythm
Figure 2. Rhythmic structures of dances recognizable by our system
class of notes is a note a quaver, a quarter note, a half note,
etc. For discretization of notes we use a modied k Nearest
Neighbours (kNN) classier, which determines the type of
note based on its duration and examples read from the learning
database. This means that the classier reads a duration of
a note from a MIDI le and then determines whether the
given duration is a duration of a quaver, a quarter note, a half
note, etc. Every type of note has its own identication number
or index which is then used as a feature in a HMM based
classier. Thus semiquaver has an index 1, dotted semiquaver
has an index 2, quaver has an index 3, dotted quaver has an
index 4, quarter note has an index 5, dotted quarter note has
an index 6, half note has an index 7, dotted half note has an
index 8 and a whole note has an index 9. Such discrete notes
are then used for learning the Hidden Markov Models.
The classier that is used for note discretization is not
an usual kNN classier. Actually, it works in the following
way: every note duration that has to be made discrete is rst
compared with the learning examples such that the differences
of the duration the note and duration of all notes in the learning
set are calculated. Our learning set has 100 examples for each
note type. Next, all examples for which the absolute value of
the mentioned difference is minimal and mutually equal are
chosen. After that, the note is classied in the class that is
most frequent among the chosen notes. For example, let us
classify a note that has duration of 0.9245. We calculate the
differences of that duration and durations of all notes in the
learning set. Then we observe the absolute values of calculated
differences. Let us assume that notes that correspond to the
minimal absolute values of differences are from set of classes
{6, 6, 6, 6, 5, 7}. Since the class 6 is the most frequent in the
set of closest classes, the note is classied in the class 6, which
represents the dotted quarter note.
C. Learning the note classier
For classifying the notes we used a Hidden Markov Model
based method, as it has been described in the third section. In
the next subsection we will explain methods for learning the
classier and then we will describe a method of classication
of a new example.
The learning processes in cases of discrete and continuous
note durations are similar. In both cases we use a Maximum
Likelihood criterion. Based on that criterion, we want to
determine the parameters of the Hidden Markov Model in
a such a way that the generating probability of learning
examples for that model will be maximal. Unfortunately, the
solution of this maximization problem can not be found in
closed form. Therefore, we need to use iterative methods
for nding the solution. This can be done in various ways,
for example with the Baum-Welch algorithm or with the
gradient descent optimization, as is explained in [5]. Instead of
Maximum Likelihood criterion, it is possible to use Maximum
Mutual Information criterion, for which the gradient descent
optimization methods are also required [5].
We learn our classier with the Maximum Likelihood
criterion because the method for iterative maximization of
this criterion is already implemented in a Hidden Markov
Model toolbox for Matlab software
1
The learning algorithm is
stopped if it converges or if it exceeds the maximum allowed
number of iterations, which, in our case, was 60 iterations.
While learning, we record the log-likelihood in each iteration
and show how it grows until it reaches its maximum. The plot
of the growth of the log-likelihood is shown in gure 3.
We have trained special HMMs for each dance, i.e. each
HMM generates the rhythmic structure of the dance it repre-
sents with the maximum likelihood. In case of the continuous
note durations, the output probabilities of each state of the
model are represented with the Gaussian distribution with
parameters
i
and
2
i
, i.e. with the mean and the variance. The
transition probabilities, priors and the Gaussian distribution
parameters are determined with the learning algorithm using
training examples. The number of states of each HMM is
determined with the 3-fold cross-validation using 60 examples.
We have determined that HMMs that represent tango, polka,
cha-cha-cha and march should have 3 states. Hidden Markov
Model that represents mazurka should have four states and
HMM that represents waltz should have ve states.
The interpretation of parameters
i
and
2
i
is obvious.
They determine the mean value of notes duration and the
1
http://www.cs.ubc.ca/
murphyk/Software/HMM/hmm.html
Figure 3. The growth of the log-likelihood in iterations of the learning
algorithm
variance around the mean. Interpretation of other parameters,
such as number of states and transition probabilities is not so
intuitive. The probability that a note is rst in the rhythmic
structure can be interpreted with the prior. We can interpret
the number of states of HMM as a number of different notes
in a rhythmic structure. The transition probabilities between
states can represent the probabilities that a certain note will
appear after another note in a rhythmic structure. For example,
a
37
represents in this interpretation a probability that half note
will appear after a quaver. In this example, we used indices
3 = quaver i 7 = halfnote.
D. Classication of dances
After learning the HMMs for each dance, the classication
of a new example is simple and intuitive. For each HMM
we calculate the likelihood that the model will generate the
given example. We then classify the example into a dance
category for which the calculated likelihood is maximal. We
calculate the likelihood of generating the given example with
the forward algorithm described in [2]. If likelihoods for
generating example are same for all HMMs, the example will
not be classied.
Let us show that on an example. Let the rhythmic structure
we want to classify be given with
X =