You are on page 1of 6

Hidden Markov Model based dance recognition

Dragutin Hrenek, Nenad Mik sa, Robert Perica, Pavle Prenta si c and Boris Trubi c
University of Zagreb, Faculty of Electrical Engineering and Computing
Unska 3, 10000 Zagreb, Croatia
E-mail: {dragutin.hrenek,nenad.miksa,robert.perica,pavle.prentasic,boris.trubic}@fer.hr
AbstractIn this paper we describe a dance classication
system for compositions written in MIDI format. The system
recognizes the following dances: tango, polka, mazurka, waltz,
cha-cha-cha and march. The rhytmic structure of a dance is a
nite sequence of notes of specied durations that repeats itself
through the whole composition, so we can hypothesise that the
probability of occurence of specied note duration depends on
the duration of the note before it. Hence the implementation of
the classier is made using Hidden Markov Models. The models
are used in two basic forms the rst assumes discrete note
durations, and the other assumes that note durations conform to
normal distribution. The system was tested using dance-prototype
generated examples with added Gaussian noise, as well as with
human-played examples. The results gathered using both kinds
of examples are comparable. The system was implemented using
the Matlab programming package.
I. INTRODUCTION
Upon hearing a certain sequence of notes or rhythm, a dance
expert or even a dance enthusiast immediately thinks of some
type of dance or movement which would best t the heard
music. Thus, he/she easily recognizes the type of dance or
music that is being played. Computer is not able to do the
same with such ease, as it is unable to focus on a specic
musical instrument in the audio recording.
MIDI (Musical Instrument Digital Interface) format is com-
monly used in musical production, besides mp3, wave and
similar formats. MIDI is a protocol by which computer com-
municates with certain external devices, such as keyboards.
The protocol is based on exchange of messages between the
device and the computer. Those messages can be saved in a le
and interpreted later as an audio or as a note inscription. The
protocol is a standard which is used by all musical instruments
and musical software, but the problem is that most of the
devices and software do not honor the protocol specications
exactly. Therefore, its a common situation when a note
inscription written in one program and saved in MIDI format,
when opened in an another program, is poorly correlated with
the source inscription. The problem resides in the fact that
each note can be written in MIDI format in various ways. This
increases the possibility of misinterpretaion of the recording.
One of the most common notes is a quarter note. In a
4/4 measure it is represented by one tick. For example, if
the tempo is 120 and measure 4/4, then there should be 120
quarter notes in one minute of recording. Thus, every quarter
note should last exactly half a second. But this holds only on
average. Let us assume that each quarter note lasts 100 ticks
of the clock. The each quaver should last 50 ticks and dotted
quaver should last 75 ticks. The dot in the note increases its
duration by 50%. For example, dotted quarter note lasts the
same as the quarter note and the quaver together. If the music
is played by human then the quarter note lasts 100 ticks only
on average, but it can last a bit more or less, e.g. 102 ticks
or 85 ticks. This depends on the melody phrasing and other
factors. Musical inscription software often adds noise to the
duration of notes when saving in the MIDI format in order
to achieve the greater delity of the recording as if it has
been played by a human. This makes the correct interpretation
of the note difcult to the computer, as for example, the note
that lasts 85 ticks is much closer to a dotted quaver than to a
quarter note.
Thus, the quarter note is often not a real quarter note.
This is the reason for wrong interpretation of notes among
different programs. Because of this problem, classication of
dances by rhythmic patterns obtained from MIDI les is a
very challenging problem in computer science. In this paper
we present methods that enable the computer to recognize the
dances based on the human-labeled examples.
The next section gives an overview of the previous works
and solutions of the described problem. The third section
describes a method for classication of musical pieces with
the Hidden Markov Model. The fourth section describes the
results of classication. The fth section concludes the paper
and discusses aspects of future work.
II. PREVIOUS WORK
The described problem is tightly related with the problem of
detecting the rhythmic structure of the musical piece. Takeda
et al. dene the problem as a search for a sequence of states
in a probabilistic model [1]. Since the states are represented
with Hidden Markov Models, the most probable sequence of
states can be nd with the well-known Viterbi algorithm [2].
Therefore, the rhythmic structure is determined by the most
probable sequence of states found by the Viterbi algorithm for
the given sequence of observations. This method is good for
nding the specic rhythmic structure, but it is impractical for
classication of rhythmic structures.
In [3], the system for extraction of musical features from
MIDI recording is described. The described system consists
of more subsystems for carrying out the following tasks:
identifying basic musical objects (notes, pauses, chords, etc.),
searching for accent on each musical object, rhythm recog-
nition, rhythm tracking and note discretization. The rhythmic
structure of the piece is recognized by looking into the time
interval which consists of certain number of notes. This time
interval is determined in advance for each potential rhythmic
structure that is being recognized. The actual notes in that
interval are then compared to the expected notes and then the
classication is performed. This method is not practical for
solving our problem as it does not give good results.
In [4] methods for note duration discretization and methods
for detection and tracking of rhythm are presented. The rhythm
detection in this paper is based on Hidden Markov Models in
such a way that each state of the model represents the moment
in which the note has been played. This enables the modeling
of different moments in which a note can appear. This method
is very useful for converting MIDI recordings into printable
musical inscription.
III. METHOD DESCRIPTION
It is a general trend to use Hidden Markov Models (HMMs)
for solving pattern recognition problems in cases where pat-
terns are time dependant signals, as for example in speech
recognition [2]. MIDI signals are time dependant signals and
they represent a more abstract way for representing music in
computer. It is much easier to extract note characteristics from
MIDI recording than from mp3 or wave. Hence, we think that
it would be a good idea to use HMMs for classication of
musical pieces recorded in MIDI format.
The idea behind HMMs assumes the existence of some set
of states Q = {q
i
}
N
i=1
, where N is the number of states.
For each state we dene probabilities of transition from the
current state into all other states and probability for staying
in current state. Furthermore, for each state we dene its a
priori probability (prior), i.e. the probability that the system
will start in this state. Besides the set of states, there exists a
set of possible outputs of the system V = {v
j
}
M
j=1
, where M
is number of possible outputs. For each state of the system, we
dene the probability that the system will generate a certain
output while being in that state. All that can be formally
written in the following way: Hidden Markov Model is a
tuple
= (, B, )
where is a transition probability matrix, B is an output
probability matrix and is a vector of priors. Elements of
the matrix are a
ij
and represent the probability of transition
from state i to state j, i.e.
a
ij
= P(q
t+1
= j|q
t
= i)
The elements of the matrix B are b
ij
and represent the
probability that the output j will be generated while the system
is in state i, i.e.
b
ij
= P(output = v
j
|q
t
= i)
The elements of the vector are
i
and represent the
probability that the system will start its work in the state i,
i.e.

i
= P(q
1
= i)
As a result of such denition of HMM, it is suitable to
represent it in a form of a directed graph. Vertices of the graph
represent the states of the HMM and the outputs of the system,
while the edges represent possible transitions between states
and possible outputs of the system for each state. The weights
of the edges represent probabilities. An example of a HMM
is shown in gure 1.
Figure 1. An example of a Hidden Markov Model. X represents the states,
Y represents the possible outputs of the system, a represents the transition
probabilities and b represents the probabilities of outputs in each state.
Possible outputs of the system can be continuous too. In
that case for each state we have to model the probability
distribution which will generate the outputs of the system in
that state, e.g. Gaussian distribution. In general, it is possible
to model different probability distribution functions for each
state, but it is common to use the same probability distribution
function in all states, but with different parameters. This
simplies the usage of the model and the learning algorithm.
Possible outputs of the system depend on the problem we try
to model using HMMs. On the other hand, the number of states
is a parameter of the model and thus inuences the complexity
of the learning.
A. Data preparation and feature selection
Our system recognizes dances using their rhythmic struc-
tures. Rhythmic structure is a sequence of notes of certain
duration, i.e. the alternation of sound and silence in time.
Rhythmic structure examples that can be recognized by our
classier are shown in gure 2. The duration of a note is the
only feature that is used by our classier as it is the only
required feature to describe the rhythmic structure of a dance.
B. Note discretization
In cases when we want to test the classication of musical
pieces by having notes represented by their class, we rst need
to perform note discretization, i.e. classify them into some
(a) Tango rhythm
(b) Polka rhythm
(c) Mazurka rhythm
(d) Waltz rhythm
(e) Cha-cha-cha rhythm
(f) March rhythm
Figure 2. Rhythmic structures of dances recognizable by our system
class of notes is a note a quaver, a quarter note, a half note,
etc. For discretization of notes we use a modied k Nearest
Neighbours (kNN) classier, which determines the type of
note based on its duration and examples read from the learning
database. This means that the classier reads a duration of
a note from a MIDI le and then determines whether the
given duration is a duration of a quaver, a quarter note, a half
note, etc. Every type of note has its own identication number
or index which is then used as a feature in a HMM based
classier. Thus semiquaver has an index 1, dotted semiquaver
has an index 2, quaver has an index 3, dotted quaver has an
index 4, quarter note has an index 5, dotted quarter note has
an index 6, half note has an index 7, dotted half note has an
index 8 and a whole note has an index 9. Such discrete notes
are then used for learning the Hidden Markov Models.
The classier that is used for note discretization is not
an usual kNN classier. Actually, it works in the following
way: every note duration that has to be made discrete is rst
compared with the learning examples such that the differences
of the duration the note and duration of all notes in the learning
set are calculated. Our learning set has 100 examples for each
note type. Next, all examples for which the absolute value of
the mentioned difference is minimal and mutually equal are
chosen. After that, the note is classied in the class that is
most frequent among the chosen notes. For example, let us
classify a note that has duration of 0.9245. We calculate the
differences of that duration and durations of all notes in the
learning set. Then we observe the absolute values of calculated
differences. Let us assume that notes that correspond to the
minimal absolute values of differences are from set of classes
{6, 6, 6, 6, 5, 7}. Since the class 6 is the most frequent in the
set of closest classes, the note is classied in the class 6, which
represents the dotted quarter note.
C. Learning the note classier
For classifying the notes we used a Hidden Markov Model
based method, as it has been described in the third section. In
the next subsection we will explain methods for learning the
classier and then we will describe a method of classication
of a new example.
The learning processes in cases of discrete and continuous
note durations are similar. In both cases we use a Maximum
Likelihood criterion. Based on that criterion, we want to
determine the parameters of the Hidden Markov Model in
a such a way that the generating probability of learning
examples for that model will be maximal. Unfortunately, the
solution of this maximization problem can not be found in
closed form. Therefore, we need to use iterative methods
for nding the solution. This can be done in various ways,
for example with the Baum-Welch algorithm or with the
gradient descent optimization, as is explained in [5]. Instead of
Maximum Likelihood criterion, it is possible to use Maximum
Mutual Information criterion, for which the gradient descent
optimization methods are also required [5].
We learn our classier with the Maximum Likelihood
criterion because the method for iterative maximization of
this criterion is already implemented in a Hidden Markov
Model toolbox for Matlab software
1
The learning algorithm is
stopped if it converges or if it exceeds the maximum allowed
number of iterations, which, in our case, was 60 iterations.
While learning, we record the log-likelihood in each iteration
and show how it grows until it reaches its maximum. The plot
of the growth of the log-likelihood is shown in gure 3.
We have trained special HMMs for each dance, i.e. each
HMM generates the rhythmic structure of the dance it repre-
sents with the maximum likelihood. In case of the continuous
note durations, the output probabilities of each state of the
model are represented with the Gaussian distribution with
parameters
i
and
2
i
, i.e. with the mean and the variance. The
transition probabilities, priors and the Gaussian distribution
parameters are determined with the learning algorithm using
training examples. The number of states of each HMM is
determined with the 3-fold cross-validation using 60 examples.
We have determined that HMMs that represent tango, polka,
cha-cha-cha and march should have 3 states. Hidden Markov
Model that represents mazurka should have four states and
HMM that represents waltz should have ve states.
The interpretation of parameters
i
and
2
i
is obvious.
They determine the mean value of notes duration and the
1
http://www.cs.ubc.ca/

murphyk/Software/HMM/hmm.html
Figure 3. The growth of the log-likelihood in iterations of the learning
algorithm
variance around the mean. Interpretation of other parameters,
such as number of states and transition probabilities is not so
intuitive. The probability that a note is rst in the rhythmic
structure can be interpreted with the prior. We can interpret
the number of states of HMM as a number of different notes
in a rhythmic structure. The transition probabilities between
states can represent the probabilities that a certain note will
appear after another note in a rhythmic structure. For example,
a
37
represents in this interpretation a probability that half note
will appear after a quaver. In this example, we used indices
3 = quaver i 7 = halfnote.
D. Classication of dances
After learning the HMMs for each dance, the classication
of a new example is simple and intuitive. For each HMM
we calculate the likelihood that the model will generate the
given example. We then classify the example into a dance
category for which the calculated likelihood is maximal. We
calculate the likelihood of generating the given example with
the forward algorithm described in [2]. If likelihoods for
generating example are same for all HMMs, the example will
not be classied.
Let us show that on an example. Let the rhythmic structure
we want to classify be given with
X =

0.9245 0.9440 0.6120

Likelihoods of the HMMs for the given example are as


following:
P(X = tango) = 0.0909
P(X = polka) = 8.9747 10
22
P(X = mazurka) = 2.3991 10
12
P(X = waltz) = 0.0199
P(X = cha-cha-cha) = 6.4525 10
9
P(X = march) = 4.9277 10
17
We classify the given example as tango, since the likelihood
that the example is tango is maximal.
In case of discrete note durations, we rst have to discretize
the example and then classify it. The classication procedure
is the same as in case with continuous note durations, with the
exception that the output symbols of each state of HMM are
discrete-valued so we do not assume a theoretical distribution
that would generate the output examples.
IV. RESULTS
A. Data set
Our system discriminates six different dances: tango, polka,
mazurka, waltz, cha-cha-cha and march, but it is easy to
add more dance types. Based on rhythmic structures that are
available on Wikipedia and that are shown in gure 2, we
generated the learning examples in two ways: synthetically
from the prototypes and by playing the rhythms on a keyboard
with the MIDI interface.
The synthetic generation of examples was done in the
following way: we assumed that the duration of the quarter
note is 120 ticks. Based on that assumption we calculated the
durations of other notes and generated the prototypes of the
rhythmic structures of dances according to rhythms displayed
in gure 2. We then added a Gaussian noise to the prototypes
in order to get synthetic examples. The mean and variance of
the Gaussian noise were randomly changed in order to get the
most heterogeneous examples.
Beside synthetically generated examples, we played the
rhythms displayed in gure 2 on a keyboard with a MIDI
interface which can be used to load the played notes into the
computer. For each dance we played 70 examples that were
used exclusively for learning and validation, whilst for cross-
validation we used additional 30 examples. As examples were
really played, they represent a real situation where the note
duration may not obey the Gaussian distribution, as it was
assumed while synthetically generating the examples. This will
also show whether the assumption that the duration of notes
obey the Gaussian distribution was correct.
B. Classication results
The classier was tested in various ways. First we used
the synthetically generated examples with continuous note
durations. We generated 50 examples for learning and 100
examples for testing in a way that has been described before.
The parameters of the additive Gaussian noise were the fol-
lowing: the variance of each example was randomly selected
from interval [0, 2] and the mean from interval [5, 5]. This
means that for each example we rst generated parameters
of the additive Gaussian noise, then we generated the noise
and nally added the noise to the dance rhythmic structure
prototypes.
The results of the rst experiment are given in table I. The
rows of the table represent the dance which was the decision
of the classier, and the columns represent the real dance type.
Table I
CONFUSION MATRIX FOR CLASSIFIER WHICH USES SYNTHETICALLY GENERATED EXAMPLES AND CONTINUOUS-VALUED DURATION OF NOTES
Dance Tango Polka Mazurka Waltz Cha-cha-cha March
Tango 100 0 0 0 0 0
Polka 0 1 0 0 0 0
Mazurka 0 0 100 0 0 0
Waltz 0 0 0 100 21 0
Cha-cha-cha 0 0 0 0 79 0
March 0 99 0 0 0 100
The accuracy of the classication is 93.33%, and precision and
recall are 80%. The F
micro
1
and F
macro
1
values are as follows:
F
micro
1
= 80%
F
macro
1
= 74.61%
We can notice that the classier is bad in discrimination
of polka and march. Great similarity between those dances is
the main reason for such behaviour if you see the rhythmic
structures of those dances on gure 2, you can see very similar
quaver and semiquaver patterns. Semiquavers and quavers are
very short notes so it is very difcult discriminating them
in this case, especially if examples have big variance and
deviations of note duration means. Poorer discrimination of
waltz from cha-cha-cha is a consequence of noise in examples.
We can see that the classier is much more accurate and
precise than just randomly picking a dance. Namely, the
accuracy of a random pick would, on average, be equal to
the probability of randomly picking the correct dance. As the
system discriminates six dances, the accuracy of the random
choice would be
1
6
16.67%.
If we make the examples from previous experiment discrete,
as we described earlier, and if we then use our classier in
a discrete domain, the results become even better (see the
table II). The classication accuracy has increased to 95.78%,
and precision and recall to 87.33%. The F
micro
1
and F
macro
1
values are as follows:
F
micro
1
= 87.33%
F
macro
1
= 85.57%
We can now notice that the classier better discriminates
polka from march, but still the most of the polka examples
are misclassied, what can of course be explained with the
great similarity of theoretical rhythmic structures of polka and
march.
In a realistic situation few musical pieces have the real the-
oretical rhythmic structure. Therefore, we tested our classier
with the examples we played by ourselves on the keyboard
with the MIDI interface. In the following experiments, we used
25 examples of each dance for learning the classier and 45
examples for testing the classication. We rst carried out an
experiment with continuous-valued note durations. The results
of the experiment are shown in table III. The classication
accuracy is 96.91%, and the precision and recall are 90, 74%.
The F
micro
1
and F
macro
1
values are as follows:
F
micro
1
= 90.74%
F
macro
1
= 89.97%
We can now notice that the classier better discriminates
waltz from cha-cha-cha, which is the consequence of better
learning and testing examples. If we use 30 examples for learn-
ing, instead of 25, the precision and accuracy of classication
raise to 100%.
When carrying out the last experiment, we used the same
examples as in previous experiment, but before classication
we discretized them and used the classier in a discrete
domain. The results of the experiment are shown in table IV.
The classication accuracy has raised to 99.32%, and precision
to 98.50%. The F
micro
1
and F
macro
1
values are as follows:
F
micro
1
= 97.95%
F
macro
1
= 97.95%
The recall is 97.41% because not all examples have been
classied, i.e. the classier has refused to classify one example
of tango and two examples of waltz so those examples were
used as false negatives when calculating the F values and
recall. This lowered the recall of the classication.
We can notice that in last experiment the discrimination rate
between polka and march has raised a lot. This tells us that this
type of classication is the best for general use. The refusal of
classication is often regarded as better than misclassication
as this enables people to manually classify the examples that
classier refused to classify.
V. CONCLUSION
In this paper we described a system which is able to
recognize dances based on MIDI recordings of the pieces.
This enables the enthusiasts that do not understand the musical
notation to recognize their favourite dances. This is an inter-
esting problem because MIDI recordings are usually played
by humans, so it is not possible to determine the type of the
note or rhythmic structure with the full certainty.
We made a classier which can classify the examples with
both discrete- and continuous-valued note durations. The clas-
sier is based on Hidden Markov Models. The discretization
of notes was made with the modied kNN classier. The
Table II
CONFUSION MATRIX FOR CLASSIFIER WHICH USES SYNTHETICALLY GENERATED EXAMPLES AND DISCRETE-VALUED NOTE DURATIONS
Dance Tango Polka Mazurka Waltz Cha-cha-cha March
Tango 100 0 0 0 0 0
Polka 0 27 0 0 0 3
Mazurka 0 0 100 0 0 0
Waltz 0 0 0 100 21 0
Cha-cha-cha 0 0 0 0 79 0
March 0 73 0 0 0 97
Table III
CONFUSION MATRIX FOR CLASSIFIER WHICH USES HUMAN-PLAYED EXAMPLES AND CONTINUOUS-VALUED NOTE DURATIONS
Dance Tango Polka Mazurka Waltz Cha-cha-cha March
Tango 45 0 0 0 0 0
Polka 0 20 0 0 0 0
Mazurka 0 0 45 0 0 0
Waltz 0 0 0 45 0 0
Cha-cha-cha 0 0 0 0 45 0
March 0 25 0 0 0 45
Table IV
CONFUSION MATRIX FOR CLASSIFIER WHICH USES HUMAN-PLAYED EXAMPLES AND DISCRETE-VALUED NOTE DURATIONS
Dance Tango Polka Mazurka Waltz Cha-cha-cha March
Tango 44 0 0 0 0 0
Polka 0 45 0 0 0 4
Mazurka 0 0 45 0 0 0
Waltz 0 0 0 43 0 0
Cha-cha-cha 0 0 0 0 45 0
March 0 0 0 0 0 41
classier was trained with synthetically generated and human-
played examples, with both continuous- and discrete-valued
note durations. The best classication rates were achieved with
human-played examples with discrete-valued note durations.
We have accomplished everything planned, although future
expansions of the system are possible, For example, it would
be possible to create a classier which would automatically
nd the characteristic rhythmic structure of a piece during the
training phase. With a greater number of examples, such a
classier would be better at learning specic music pieces,
in contrast to our work where we used theoretical rhytmic
structure.
REFERENCES
[1] H. Takeda, N. Saito, T. Otsuki, M. Nakai, H. Shimodaira, and
S. Sagayama, Hidden Markov model for automatic transcription of MIDI
signals, in Multimedia Signal Processing, 2002 IEEE Workshop on.
IEEE, 2003, pp. 428431.
[2] L. Rabiner, A tutorial on hidden Markov models and selected applica-
tions in speech recognition, Proceedings of the IEEE, vol. 77, no. 2, pp.
257286, 1989.
[3] E. Cambouropoulos, From MIDI to traditional musical notation, in
Proceedings of the AAAI Workshop on Articial Intelligence and Music:
Towards Formal Models for Composition, Performance and Analysis,
vol. 30, 2000.
[4] M. Hamanaka, M. Goto, H. Asoh, and N. Otsu, A learning-based quan-
tization: Estimation of onset times in a musical score, in Proceedings of
the 5th World Multi-conference on Systemics, Cybernetics and Informatics
(SCI 2001, vol. 10, 2001, pp. 374379.
[5] N. Warakagoda, A Hybrid ANN-HMM ASR System with NN-based Adap-
tive Preprocessing. Institutt for teleteknikk, NTH, 1994.

You might also like