Professional Documents
Culture Documents
Sponsor: Joy Mustafi, MUST Research Club | ISB Mentor: Peeyush Taori
Capstone Project – Music Genre Classification
1 Abbreviations................................................................................................................................... 3
2 Motivation ....................................................................................................................................... 4
3 Project Description .......................................................................................................................... 6
3.1 Music Genre Classification Problem ........................................................................................ 6
3.2 Data Set .................................................................................................................................. 6
3.2.1 GTZAN Dataset ................................................................................................................... 6
3.2.2 Limitations of GTZAN dataset.............................................................................................. 7
3.2.3 Million Song Dataset ........................................................................................................... 7
3.2.4 Limitations of Million Song Dataset ..................................................................................... 8
4 Classification Methodology .............................................................................................................. 9
4.1 Data Setup .............................................................................................................................. 9
4.2 Features extracted from GTZAN data set............................................................................... 10
4.2.1 Mel-Frequency Cepstral Coefficients (MFCC) .................................................................... 10
4.2.2 Mel Spectrogram .............................................................................................................. 10
4.2.3 RMSE ................................................................................................................................ 11
4.2.4 Chromagram features ....................................................................................................... 11
4.2.5 Spectral Centroid: ............................................................................................................. 12
4.2.6 Spectral Contrast .............................................................................................................. 13
4.2.7 Tonal Centroid .................................................................................................................. 13
4.3 Features Selected from Million Song Dataset ........................................................................ 14
4.3.1 Segment Pitches ............................................................................................................... 14
4.3.2 Segment Timber ............................................................................................................... 14
4.3.3 Loudness .......................................................................................................................... 14
4.3.4 Tempo .............................................................................................................................. 14
4.4 Dimensionality Reduction ..................................................................................................... 14
4.4.1 Linear Discriminant Analysis.............................................................................................. 14
5 Machine Learning Methods Tested ................................................................................................ 16
5.1 Classification using Multi-Layer Perceptron ........................................................................... 16
5.2 Classification using Support Vector Machines (SVM) ............................................................. 18
5.3 Classification using Gaussian Naïve Bayes ............................................................................. 19
Music information retrieval has assumed lot of significance in the recent past owing to wide business
applications. These include recommender systems, track separation and instrument recognition,
automatic music transcription, automatic categorization / genre classification, music generation etc.[1]
The Technology Dynamics that are driving the Musical Information Retrieval
Consumption of music online via streaming has gained popularity as downloading and storing music files
has become easier, large collection of albums are available on the cloud either as a free service or as a
paid service. One of the key elements music data management is to identify a particular audio file with
respect to the genre it belongs to and store such large quantity of files grouped by genre for easier
management. These days, online radio stations play songs to a particular user based on the genre
preference. Many online music streaming services recommend to play a specific song / an audio clip for
a given user based on their browsing or search history on internet and these streaming services even
come up with a concept called smart playlists based on the music played by the user or preferences.
With such diverse applications and a large volume of music data being used, music database
management is inevitable and it is becoming a big data problem to solve these days.
Music genre classification is, an ambiguous and subjective task. Also, it is an area of research that is
being con-tested, either for low classification accuracy or because some say that one is not able to
We referred to previously published efforts on music genre classification and found that the model
accuracies did not go beyond 84% (approx)[3], when the model is trained on a set of low level features. In
this project, we present a novel approach for music genre classification using low level audio features
directly extracted from the raw audio file that can improve the model accuracy beyond 85%. A set of low
level features have been identified which have been proven to be effective in separating the genre. On
these features, we applied linear discriminant analysis (LDA) to identify the most important factors that
can effectively discriminate the classes. These factors have been in turn used as input for various
supervised machine learning models and comparative study is done to identify the best model. Our
experiments confirmed that a high level of accuracy in classification can be achieved through a
combination of feature selection, dimensionality reduction and supervised machine learning techniques.
Music genre classification is achieved by learning the characteristics of collections of songs for which
genres are already determined. This method is termed as supervised machine learning technique.
Another approach is unsupervised learning. In this approach, unlabeled songs are analyzed. By
examining their characteristics, the algorithm will then attempt to build clusters of songs based on
similarities. This project aims to build a machine learning model using supervised learning methods and
use the same model to predict the genre of a given musical clip.
This dataset was used for the well-known paper in genre classification “Musical genre classification of
audio signals“ by G. Tzanetakis et .al. [6]
GTZAN dataset does not have extracted features and hence features are extracted from the raw audio
files as explained in the next section.
MSD does not supply the associated genre tag. The genre for 1,91,000 tracks of MSD is provided by
Tagtraum Industries.[8] The tagtraum genre annotations are based on multiple source datasets and allow
for ambiguity. Details can be found in [9]. Three ground truths were generated based on the Last.fm
dataset, the Top-MAGD dataset and the beaTunes Genre Dataset (BGD). (beaTunes is an advanced
music application for Windows and OS X that lets you analyze, inspect, and play songs—and create
compelling playlists)
• GTZAN dataset: This data set has 1000 audio clips from ten different genres.
• Million song data set : Although we have a million audio files, we had genres defined only for
190,000 (approx.) audio files. We used these 190,000 audio files as the input data set for
classification.
The classification models are built separately on each data set and the accuracies are discussed
separately in the coming sections of this report. The primary reason for this is, in case of GTZAN data
set, we could extract several low level audio features that used for processing and classification and in
case of million song data set, the source audio clips were not available, hence we had to depend on
features supplied by EchoNest.
The data sets are divided into train and test sub sets of 70% and 30% each.
In the following section various features used for building the classification models have been explained.
The filter bank is what makes MFCCs unique. It is constructed using 13 linearly spaced filters and 27 log-
spaced filters, following a common model for human auditory perception. The distance between the
center frequencies of the linearly spaced filters is 133,33 Hz; the log-spaced filters are separated by a
factor of 1.071 in frequency. The final cosine transform (step 5) is applied to reduce the dimensionality
of the output, typically to the 12 most important coefficients. Additionally, the power of the signal for
each frame is calculated, resulting in a feature vector of d = 13.
MFCCs are commonly used in speech recognition systems, and seem to capture the perceptually
relevant part of the spectrum better than other techniques. They have successfully been applied to the
content-based retrieval of audio samples and also used in music genre recognition systems.
The MFCC plot is harder to interpret visually than the spectrogram, but has been found to yield better
results in computer sound analysis.
Sample plot of MFCC for a rock music file from GTZAN dataset
4.2.3 RMSE
The energy [10] of a signal corresponds to the total magnitude of the signal. For audio signals, that
roughly corresponds to how loud the signal is. The energy in a signal is defined as
Identifying pitches that differ by an octave, chroma features show a high degree of robustness to
variations in timbre and closely correlate to the musical aspect of harmony. This is the reason why
chroma features are a well-established tool for processing and analyzing music data. For example, every
chord recognition procedure relies on some kind of chroma representation. Also, chroma features have
become the de facto standard for tasks such as music alignment and synchronization as well as audio
structure analysis. Finally, chroma features have turned out to be a powerful mid-level feature
representation in content-based audio retrieval such as cover song identification or audio matching.
In the current project we used Chroma variant “Chroma Energy Normalized”(Chroma CEN).
where x(n) represents the weighted frequency value, or magnitude, of bin number n, and f(n) represents
the center frequency of that bin.
Because the spectral centroid is a good predictor of the "brightness" of a sound, it is widely used in
digital audio and music processing as an automatic measure of musical timbre
Spectral Centroid Plot for Sample rock file from GTZAN dataset
Power Spectrogram and Spectral Contrast of sample rock file from GTZAN dataset
Tonal Centroid plot for sample rock file from GTZAN dataset
4.3.3 Loudness
The general loudness of the track.
4.3.4 Tempo
Tempo in Beats Per Minutes as per Echo nest.
A middle ground would be to use dimensional reduction, which will reduce the number of features
needed and at the same time retain the efficacy of the available features that can effectively distinguish
the classes. One such approach is Discriminant Analysis. In this Project, we have used Linear
Discriminant Analysis.
Linear Discriminant Analysis (LDA) [14] is commonly used dimensionality reduction technique in the pre-
processing step for pattern-classification and machine learning applications. The goal is to project a
dataset onto a lower-dimensional space with good class-seperability in order avoid overfitting (“curse of
dimensionality”) and also reduce computational costs.
We used LDA as the basis for our approach throughout the project notwithstanding the final Machine
Learning Models developed. In both of the cases that is GTZAN dataset and Million Song dataset, LDA is
used to arrive at the factors which best discriminate the classes. These factors are in turn used in various
Supervised Learning Models and a comparison is done for the Model accuracy.
Multilayer perceptron models are often applied to supervised learning problems: they train on a set of
input-output pairs and learn to model the correlation (or dependencies) between those inputs and
outputs. Training involves adjusting the parameters, or the weights and biases, of the model in order to
minimize error. Backpropagation is used to make those weigh and bias adjustments relative to the error,
and the error itself can be measured in a variety of ways, including by root mean squared error (RMSE).
The basic models of SVM are used for binary classification. In the current project, we have a multi class
classification problem and hence a variation of SVM known as SVM SVC(C-Support Vector Classification)
has been used. SVC implements the “one-against-one” approach (Knerr et al., 1990[18]) for multi- class
classification. If n_class is the number of classes, then n_class * (n_class - 1) / 2 classifiers are
constructed and each one trains data from two classes. To provide a consistent interface with other
classifiers, the decision_function_shape option allows to aggregate the results of the “one-against-one”
classifiers to a decision function of shape (n_samples, n_classes)
Naïve Bayes model output on GTZAN and Million Song Data Set
Random Forest model output on GTZAN and Million Song Data Set
Our experiments showed that for both the datasets, LDA followed by multiclass SVM (SVC: C-Support
Vector Classification) gave the highest test accuracy. While Gaussian Naïve Bayes did well for GTZAN
dataset, it fell way short in terms of accuracy for Million Song Dataset. For both the datasets SVM-SVC
stood to be the best classifier with highest Test accuracy.
We compared our experimental results against the research available for GTZAN genre classification and
found that our accuracies are much higher than the state-of-art models published [20]. The table below
shows the accuracy results of the state of the art genre classification models for GTZAN dataset:
No of Best Model
Title Authors Institution Features
Genres & Accuracy
MFCC, Spectral
Archit Rathore - Centroid, Zero Ply Kernel
Music Genre 12152 Crossing SVM
IIT Kanpur 10
Classification[21] Margaux Dorido Rate,Chroma
- EXY1420 Frequencies, 78%
Spectral Roll Off
Music Genre
Classification and Multi class
Miguel
Variance Stanford SVM
Francisco, Dong MFCC,Chroma 10
Comparison on University
Myung Kim
Number of 35%
Genres[22]
Features are
Non-
Ioannis extracted from
Negative
MUSIC GENRE Panagakis, the cortical
Aristotle Tensor
CLASSIFICATION: A Emmanouil representation of
University of 10 Factorizati
MULTILINEAR Benetos, and sound using
Thessaloniki on(NTF)
APPROACH[23] Constantine multilinear
Kotropoulos subspace analysis
78.20%
techniques
MFCC, Mel
LDA with
Sudheer Spectrogram,
Music Genre Multi Class
Peddineni, Indian School Chroma CENS,
Classification using 10 SVM
Sameer Kumar Of Business Spectral Centroid,
Machine Learning
Vittala Spectral Contrast,
94.33%
Tonal Features
SVM-SVC with Radial basis function kernel is extensively used for various classification problems like
Pattern classification, gene classification etc. and found to be robust and scalable [19]
Since classification tasks require high accuracy, we recommend LDA followed by SVM-SVC to be the
model to be used for Audio genre classification using extracted features from raw audio files.
It is also worthwhile to note that KNN, Gaussian Naïve Bayes and Random Forest models also gave
test accuracies higher than 90 % for GTZAN dataset.
Our work can be further extended to develop an application where in the user interface would be
provided to allow the user to provide the url of any song or upload the song from local device. The
Most of the modern-day music is a fusion of multiple genre like blues+ classical, indie pop + metallic +
jazz etc. Both the datasets used have pure genre labels instead of fusion genre.
The GTZAN dataset and the genre are applicable for only western music and there are many other styles
like Indian, Asian, Middle Eastern music style etc., which, are not in the scope of the current project.
Segment timber definition states that it is like MFCC+PCA but is not a representative as MFCC of librosa
package that was used to extract the low level feature of GTZAN data set.
Some of features like danceability, loudness, energy have many zeros and missing values making them
unusable
Lack of access to direct audio files of Million song tracks limits our ability to extract the features as we
could do for GTZAN dataset. This is one of the primary reason for the accuracy of models using MSD
dataset being less than 65 % as opposed to 90 % and above for GTZAN dataset.
The total number of tracks with genre are 1,91,000 and extracting features from HDF5 files is
computationally intensive which eventually drained our CPU resources.
All the previous efforts were aimed at using existing features instead of transformed features, which
would not only reduce the number of dimensions but also be effective in discriminating the classes. LDA
has proved to be a very effective tool in reducing the number of dimensions as well as identifying the
components that can effectively discriminate the classes. We would strongly recommend use of
dimension reduction methods like LDA and QDA for future efforts of classification.
Having a sample of 100 audio files per decade per genre will be more effective to improve the predictive
accuracy of the models. Efforts must be made to accumulate such a collection.
Most of the modern-day music is a fusion of multiple genre like blues+ classical, indie pop + metallic +
jazz etc. Having additional fusion genre and related audio files will make the model more effective from
a commercialization perspective.
The GTZAN dataset and the genre are applicable for only western music and there are many other styles
like Indian, Asian, Middle Eastern music style. Preparing a dataset with sufficient samples encompassing
all the styles is a very humongous task but if accomplished will provide additional teeth to the model.
Access to all the raw audio files for all the tracks of MSD will be very useful in developing more accurate
and effective models based on Deep Learning.