You are on page 1of 29

ISB

– CBA BATCH 8 – CAPSTONE PROJECT REPORT

Music Genre Classification using Machine


Learning Techniques
Sudheer Peddineni (71710078) & Sameer Kumar Vittala (71710099)

Sponsor: Joy Mustafi, MUST Research Club | ISB Mentor: Peeyush Taori

Capstone Project – Music Genre Classification
1 Abbreviations................................................................................................................................... 3
2 Motivation ....................................................................................................................................... 4
3 Project Description .......................................................................................................................... 6
3.1 Music Genre Classification Problem ........................................................................................ 6
3.2 Data Set .................................................................................................................................. 6
3.2.1 GTZAN Dataset ................................................................................................................... 6
3.2.2 Limitations of GTZAN dataset.............................................................................................. 7
3.2.3 Million Song Dataset ........................................................................................................... 7
3.2.4 Limitations of Million Song Dataset ..................................................................................... 8
4 Classification Methodology .............................................................................................................. 9
4.1 Data Setup .............................................................................................................................. 9
4.2 Features extracted from GTZAN data set............................................................................... 10
4.2.1 Mel-Frequency Cepstral Coefficients (MFCC) .................................................................... 10
4.2.2 Mel Spectrogram .............................................................................................................. 10
4.2.3 RMSE ................................................................................................................................ 11
4.2.4 Chromagram features ....................................................................................................... 11
4.2.5 Spectral Centroid: ............................................................................................................. 12
4.2.6 Spectral Contrast .............................................................................................................. 13
4.2.7 Tonal Centroid .................................................................................................................. 13
4.3 Features Selected from Million Song Dataset ........................................................................ 14
4.3.1 Segment Pitches ............................................................................................................... 14
4.3.2 Segment Timber ............................................................................................................... 14
4.3.3 Loudness .......................................................................................................................... 14
4.3.4 Tempo .............................................................................................................................. 14
4.4 Dimensionality Reduction ..................................................................................................... 14
4.4.1 Linear Discriminant Analysis.............................................................................................. 14
5 Machine Learning Methods Tested ................................................................................................ 16
5.1 Classification using Multi-Layer Perceptron ........................................................................... 16
5.2 Classification using Support Vector Machines (SVM) ............................................................. 18
5.3 Classification using Gaussian Naïve Bayes ............................................................................. 19

MUSIC GENRE CLASSIFICATION USING MACHINE LEARNING


1
SUDHEER PEDDINENI & SAMEER VITTALA (CBA BATCH 8)
5.4 Classification using Random Forest Classifier......................................................................... 20
5.5 K-Nearest Neighbor (KNN) .................................................................................................... 21
6 Model Comparison......................................................................................................................... 23
7 Challenges ..................................................................................................................................... 26
8 Conclusion/Recommendations....................................................................................................... 27
9 References ..................................................................................................................................... 28

MUSIC GENRE CLASSIFICATION USING MACHINE LEARNING


2
SUDHEER PEDDINENI & SAMEER VITTALA (CBA BATCH 8)
1 Abbreviations

List of acronyms used in this project report

Acronym Expanded Form


ADaBFFs AdaBoost with decision trees and bags of frames of features
BMGD beaTunes Genre Dataset
GNB Gaussian Naïve Bayes
ISMIR The International Society of Music Information Retrieval
KNN K-Nearest Neighbor
LDA Linear Discriminant Analysis.
LFMGD Last FM Genre Dataset
MAPsCAT Maximum a posteriori classification of scattering coefficients
MFCC Mel-Frequency Cepstral Coefficients
MIR Music Information and Retrieval
MLN Multi Layer Neural Network
MLP Multi Layer Perceptron
MSD Million Song Dataset
PCA Principal Component Analysis
SRCAM Sparse representation classification with auditory temporal modulations

MUSIC GENRE CLASSIFICATION USING MACHINE LEARNING


3
SUDHEER PEDDINENI & SAMEER VITTALA (CBA BATCH 8)
2 Motivation
Over the past decade, large collections of music are increasingly available on various application
platforms. Therefore, tasks such as music discovery, navigation, and organization have become
progressively harder for humans without the help of automated systems. Extensive research effort has
been invested in music information retrieval (MIR) at the intersection of signal processing, music
modeling, and machine learning.

Music information retrieval has assumed lot of significance in the recent past owing to wide business
applications. These include recommender systems, track separation and instrument recognition,
automatic music transcription, automatic categorization / genre classification, music generation etc.[1]

The Technology Dynamics that are driving the Musical Information Retrieval

Consumption of music online via streaming has gained popularity as downloading and storing music files
has become easier, large collection of albums are available on the cloud either as a free service or as a
paid service. One of the key elements music data management is to identify a particular audio file with
respect to the genre it belongs to and store such large quantity of files grouped by genre for easier
management. These days, online radio stations play songs to a particular user based on the genre
preference. Many online music streaming services recommend to play a specific song / an audio clip for
a given user based on their browsing or search history on internet and these streaming services even
come up with a concept called smart playlists based on the music played by the user or preferences.
With such diverse applications and a large volume of music data being used, music database
management is inevitable and it is becoming a big data problem to solve these days.

Music genre classification is, an ambiguous and subjective task. Also, it is an area of research that is
being con-tested, either for low classification accuracy or because some say that one is not able to

MUSIC GENRE CLASSIFICATION USING MACHINE LEARNING


4
SUDHEER PEDDINENI & SAMEER VITTALA (CBA BATCH 8)
classify genres that does not even have clear definitions. End users are nonetheless already accustomed
to browse both physical and on-line music collections by genre, and this approach is proven to be
reasonably effective. Particularly, a recent survey [2], for example, found that end users are more likely
to browse and search by genre than by recommendation, artist similarity or music similarity.

We referred to previously published efforts on music genre classification and found that the model
accuracies did not go beyond 84% (approx)[3], when the model is trained on a set of low level features. In
this project, we present a novel approach for music genre classification using low level audio features
directly extracted from the raw audio file that can improve the model accuracy beyond 85%. A set of low
level features have been identified which have been proven to be effective in separating the genre. On
these features, we applied linear discriminant analysis (LDA) to identify the most important factors that
can effectively discriminate the classes. These factors have been in turn used as input for various
supervised machine learning models and comparative study is done to identify the best model. Our
experiments confirmed that a high level of accuracy in classification can be achieved through a
combination of feature selection, dimensionality reduction and supervised machine learning techniques.

MUSIC GENRE CLASSIFICATION USING MACHINE LEARNING


5
SUDHEER PEDDINENI & SAMEER VITTALA (CBA BATCH 8)
3 Project Description
Today in the 21st century, the usage of audio music files has grown enormously. With large amounts of
audio files comes the need to classify the files to organize them without human intervention. Automatic
music genre recognition (MGR) is a sub field of music information retrieval (MIR)[1]. Algorithms use
features of the sound files found in the sound waves to classify the files. This project is aimed at
developing such a solution to implement genre classification of audio files.

3.1 Music Genre Classification Problem


The question of which genre a music (file) belongs to, is a question of classification - a semantic
problem. Music can be classified by its time of creation, geographical origin, topic or a set of rules
related to the sound. Some of these facts are often added to the files as metadata because they cannot
be retrieved from the sound waves. Humans however classify music by the perception of the sound
produced by the audio signal. Music genre is subjective from person to person and can be ambiguous.
On top of that, a music file can be assigned more than one genre and using more than one classification
category, e.g. "British music" and "rock music" can come together as "Brit-rock". It is therefore
questionable to speak in terms of "accuracy", "hit" or "miss" if a song cannot be objectively assigned a
genre. Every accuracy value can therefore only be regarded as some fuzzy approximation [4].

Music genre classification is achieved by learning the characteristics of collections of songs for which
genres are already determined. This method is termed as supervised machine learning technique.
Another approach is unsupervised learning. In this approach, unlabeled songs are analyzed. By
examining their characteristics, the algorithm will then attempt to build clusters of songs based on
similarities. This project aims to build a machine learning model using supervised learning methods and
use the same model to predict the genre of a given musical clip.

3.2 Data Set


We used two different datasets to train the model for music genre classification, one of the most well
studied problems in MIR.

3.2.1 GTZAN Dataset


It is composed of 1,000 30-second clips covering ten genres, which is a balanced dataset, with 100 clips
per genre in .au format. GTZAN data set primarily comprises of western musical genres such as Blues,
Classical, Country, Disco, Hip Hop, Jazz, Metal, Pop, Regge and Rock. [5]

This dataset was used for the well-known paper in genre classification “Musical genre classification of
audio signals“ by G. Tzanetakis et .al. [6]

GTZAN dataset does not have extracted features and hence features are extracted from the raw audio
files as explained in the next section.

MUSIC GENRE CLASSIFICATION USING MACHINE LEARNING


6
SUDHEER PEDDINENI & SAMEER VITTALA (CBA BATCH 8)
3.2.2 Limitations of GTZAN dataset
The audio files of GTZAN dataset belonged to a specific decade and since then the genres have evolved
over a period. Often we have a fusion of two or more genres like classic blues, Blues Metal Jazz etc. in
vogue. The number of files per genre is too less to conclusively generalize the results of the models
against the current age music files. The wave plots of a sample rock music file from GTZAN and a
contemporary music file is displayed below which illustrates the difference.

Wave plot of a rock music clip from GTZAN

Wave plot of a rock music clip from contemporary collection

3.2.3 Million Song Dataset


The Million Song Dataset (MSD) [7] contains 1,000,000 songs from 44,745 unique artists. The MSD does
not distribute raw acoustic signals (for copyright reasons), but does distribute a range of extracted audio
features, many of which can be used for classification.

MSD does not supply the associated genre tag. The genre for 1,91,000 tracks of MSD is provided by
Tagtraum Industries.[8] The tagtraum genre annotations are based on multiple source datasets and allow
for ambiguity. Details can be found in [9]. Three ground truths were generated based on the Last.fm
dataset, the Top-MAGD dataset and the beaTunes Genre Dataset (BGD). (beaTunes is an advanced
music application for Windows and OS X that lets you analyze, inspect, and play songs—and create
compelling playlists)

MUSIC GENRE CLASSIFICATION USING MACHINE LEARNING


7
SUDHEER PEDDINENI & SAMEER VITTALA (CBA BATCH 8)
For the current project, we used the third ground truth that is based on modified BGD and Last FM
Genre Data(LFMGD). Additional labels Metal and Punk, International = World, removed Vocal. Any
ambiguous labels have been removed.

3.2.4 Limitations of Million Song Dataset


As mentioned before, for the MSD dataset we have access only to pre-extracted audio features of
EchoNest. Some of features like danceability, loudness, energy had zeros and missing values constituting
to about 90% of the values. Hence such features could not be used. Lack of access to raw audio files
limited the ability to extract chroma, spectral features directly from the source. It was observed during
experiments that the low level audio features extracted from audio files for GTZAN dataset were very
effective in discriminating the genre.

MUSIC GENRE CLASSIFICATION USING MACHINE LEARNING


8
SUDHEER PEDDINENI & SAMEER VITTALA (CBA BATCH 8)
4 Classification Methodology
The data set we used for this project had pre-defined classes for each of the audio track. We hence used
several supervised machine learning techniques to classify the genre. The overall methodology followed
for classification is summarized as shown in the picture below.

4.1 Data Setup


As discussed in the previous section, we used two different datasets to evaluate the performance on
music genre classification, one of most well studied problems in MIR:

• GTZAN dataset: This data set has 1000 audio clips from ten different genres.
• Million song data set : Although we have a million audio files, we had genres defined only for
190,000 (approx.) audio files. We used these 190,000 audio files as the input data set for
classification.

The classification models are built separately on each data set and the accuracies are discussed
separately in the coming sections of this report. The primary reason for this is, in case of GTZAN data
set, we could extract several low level audio features that used for processing and classification and in
case of million song data set, the source audio clips were not available, hence we had to depend on
features supplied by EchoNest.

The data sets are divided into train and test sub sets of 70% and 30% each.

In the following section various features used for building the classification models have been explained.

MUSIC GENRE CLASSIFICATION USING MACHINE LEARNING


9
SUDHEER PEDDINENI & SAMEER VITTALA (CBA BATCH 8)
4.2 Features extracted from GTZAN data set

4.2.1 Mel-Frequency Cepstral Coefficients (MFCC)


Human perception of the frequency content of sounds does not follow a linear scale but uses a
logarithmic distribution. Mel-frequency cepstral coefficients (MFCCs) are based on the spectral
information of a sound, but are modelled to capture the perceptually relevant parts of the auditory
spectrum6. The sequence of processing is as follows:

• Window the data (e.g. with a Hamming window);


• Calculate the magnitude of the FFT;
• Convert the FFT data into filter bank outputs;
• Calculate the log base 10;
• Calculate the cosine transform.

The filter bank is what makes MFCCs unique. It is constructed using 13 linearly spaced filters and 27 log-
spaced filters, following a common model for human auditory perception. The distance between the
center frequencies of the linearly spaced filters is 133,33 Hz; the log-spaced filters are separated by a
factor of 1.071 in frequency. The final cosine transform (step 5) is applied to reduce the dimensionality
of the output, typically to the 12 most important coefficients. Additionally, the power of the signal for
each frame is calculated, resulting in a feature vector of d = 13.

MFCCs are commonly used in speech recognition systems, and seem to capture the perceptually
relevant part of the spectrum better than other techniques. They have successfully been applied to the
content-based retrieval of audio samples and also used in music genre recognition systems.

The MFCC plot is harder to interpret visually than the spectrogram, but has been found to yield better
results in computer sound analysis.

Sample plot of MFCC for a rock music file from GTZAN dataset

4.2.2 Mel Spectrogram


Mel spectrogram represents an acoustic time-frequency representation of a sound: the power spectral
density P(f, t).It is sampled into a number of points around equally spaced times ti and frequencies fj (on
a Mel frequency scale). The Mel frequency scale is defined as:

MUSIC GENRE CLASSIFICATION USING MACHINE LEARNING


10
SUDHEER PEDDINENI & SAMEER VITTALA (CBA BATCH 8)
mel = 2595 * log10 (1 + hertz / 700),
and its inverse is:
hertz = 700 * (10.0mel / 2595.0 - 1).

4.2.3 RMSE
The energy [10] of a signal corresponds to the total magnitude of the signal. For audio signals, that
roughly corresponds to how loud the signal is. The energy in a signal is defined as

The root-mean-square energy (RMSE) in a signal is defined as

4.2.4 Chromagram features


In the music context, the term chroma feature or chromagram closely relates to the twelve different
pitch classes.[11] Chroma-based features, which are also referred to pitch class profiles, are a powerful
tool for analyzing music whose pitches can be meaningfully categorized (often into twelve categories)
and whose tuning approximates to the equal-tempered scale. One main property of chroma features is
that they capture harmonic and melodic characteristics of music, while being robust to changes in
timbre and instrumentation.

Identifying pitches that differ by an octave, chroma features show a high degree of robustness to
variations in timbre and closely correlate to the musical aspect of harmony. This is the reason why
chroma features are a well-established tool for processing and analyzing music data. For example, every
chord recognition procedure relies on some kind of chroma representation. Also, chroma features have
become the de facto standard for tasks such as music alignment and synchronization as well as audio
structure analysis. Finally, chroma features have turned out to be a powerful mid-level feature
representation in content-based audio retrieval such as cover song identification or audio matching.

In the current project we used Chroma variant “Chroma Energy Normalized”(Chroma CEN).

MUSIC GENRE CLASSIFICATION USING MACHINE LEARNING


11
SUDHEER PEDDINENI & SAMEER VITTALA (CBA BATCH 8)
Sample plot of Chroma Normalized feature for Rock file form GTZAN

4.2.5 Spectral Centroid:


The spectral centroid is a measure used in digital signal processing to characterize a spectrum. It
indicates where the "center of mass" of the spectrum is located. Perceptually, it has a robust connection
with the impression of "brightness" of a sound. It is calculated as the weighted mean of the frequencies
present in the signal, determined using a Fourier transform, with their magnitudes as the weights:

where x(n) represents the weighted frequency value, or magnitude, of bin number n, and f(n) represents
the center frequency of that bin.

Because the spectral centroid is a good predictor of the "brightness" of a sound, it is widely used in
digital audio and music processing as an automatic measure of musical timbre

Spectral Centroid Plot for Sample rock file from GTZAN dataset

MUSIC GENRE CLASSIFICATION USING MACHINE LEARNING


12
SUDHEER PEDDINENI & SAMEER VITTALA (CBA BATCH 8)
4.2.6 Spectral Contrast
Octave-based Spectral Contrast [12] introduced by Jiang et al. considers the spectral peak, spectral valley
and their difference in each sub-band. For most music, the strong spectral peaks roughly correspond
with harmonic components; while non-harmonic components, or noises, often appear at spectral
valleys. Thus, Spectral Contrast feature could roughly reflect the relative distribution of the harmonic
and non-harmonic components in the spectrum.

Power Spectrogram and Spectral Contrast of sample rock file from GTZAN dataset

4.2.7 Tonal Centroid


Tonal Centroid introduced by Harte et al. [13] maps a Chroma gram onto a six-dimensional Hypertorus
structure. The resulting representation wraps around the surface of Hypertorus, and can be visualized as
a set of three circles of harmonic pitch intervals: fifths, major thirds and minor thirds. Tonal Centroids
are efficient in detecting the changes in harmonic contents.

Tonal Centroid plot for sample rock file from GTZAN dataset

MUSIC GENRE CLASSIFICATION USING MACHINE LEARNING


13
SUDHEER PEDDINENI & SAMEER VITTALA (CBA BATCH 8)
4.3 Features Selected from Million Song Dataset
The Million Song dataset as mentioned earlier is not a collection of Audio files but a collection of
Features extracted by Echo nest. We had to choose the best available low-level features, which are close
to the GTZAN features.

4.3.1 Segment Pitches


This is a Chroma feature with one value per note. The data type is a 2D array and for each column, the
mean and standard deviation were considered.

4.3.2 Segment Timber


These are texture features line MFCC+PCA. The data type is a 2D array and for each column, the mean
and standard deviation were considered

4.3.3 Loudness
The general loudness of the track.

4.3.4 Tempo
Tempo in Beats Per Minutes as per Echo nest.

4.4 Dimensionality Reduction


The total no of features extracted from GTZAN were 340 and that of million song data set was 50. So, it
became imperative for us to explore the option of dimensionality reduction (to avoid the “curse of
dimensionality”) that can help us to identify the most important features, which can classify the genres
effectively.

4.4.1 Linear Discriminant Analysis


There are various approaches in using the features for classification problems. One approach is to use
feature engineering to come up with new features by combining two or more features that can explain
the class effectively and eliminate the possibility of correlation.

MUSIC GENRE CLASSIFICATION USING MACHINE LEARNING


14
SUDHEER PEDDINENI & SAMEER VITTALA (CBA BATCH 8)
The other approach is to feed all Features to the Deep Learning network to allow it to figure the weights.
This approach could be computationally intensive requiring more resources and hence would be costly if
scalability is one of the key criteria.

A middle ground would be to use dimensional reduction, which will reduce the number of features
needed and at the same time retain the efficacy of the available features that can effectively distinguish
the classes. One such approach is Discriminant Analysis. In this Project, we have used Linear
Discriminant Analysis.

Linear Discriminant Analysis (LDA) [14] is commonly used dimensionality reduction technique in the pre-
processing step for pattern-classification and machine learning applications. The goal is to project a
dataset onto a lower-dimensional space with good class-seperability in order avoid overfitting (“curse of
dimensionality”) and also reduce computational costs.

We used LDA as the basis for our approach throughout the project notwithstanding the final Machine
Learning Models developed. In both of the cases that is GTZAN dataset and Million Song dataset, LDA is
used to arrive at the factors which best discriminate the classes. These factors are in turn used in various
Supervised Learning Models and a comparison is done for the Model accuracy.

MUSIC GENRE CLASSIFICATION USING MACHINE LEARNING


15
SUDHEER PEDDINENI & SAMEER VITTALA (CBA BATCH 8)
5 Machine Learning Methods Tested
We used several machine learning techniques to classify the genres of the two data sets discussed
previously in this report. A brief overview of the underlying algorithm and the results are presented
below.

5.1 Classification using Multi-Layer Perceptron


A multilayer perceptron (MLP) is a deep, artificial neural network. It is composed of more than one
perceptron (A perceptron is a linear classifier; that is, it is an algorithm that classifies input by separating
two categories with a straight line. Input is typically a feature vector x multiplied by weights w and
added to a bias b: y = w * x + b. A perceptron produces a single output based on several real-valued
inputs by forming a linear combination using its input weights (and sometimes passing the output
through a nonlinear activation function).). They are composed of an input layer to receive the signal, an
output layer that makes a decision or prediction about the input, and in between those two, an arbitrary

MUSIC GENRE CLASSIFICATION USING MACHINE LEARNING


16
SUDHEER PEDDINENI & SAMEER VITTALA (CBA BATCH 8)
number of hidden layers that are the true computational engine of the MLP. MLPs with one hidden layer
are capable of approximating any continuous function.

Multilayer perceptron models are often applied to supervised learning problems: they train on a set of
input-output pairs and learn to model the correlation (or dependencies) between those inputs and
outputs. Training involves adjusting the parameters, or the weights and biases, of the model in order to
minimize error. Backpropagation is used to make those weigh and bias adjustments relative to the error,
and the error itself can be measured in a variety of ways, including by root mean squared error (RMSE).

MLP model output on GTZAN and Million Song Data Set

GTZAN Million Song Data Set


No of Genres 10 15
No of Features extracted 340 50
No of Features after 9 14
dimensionality reduction
Test Accuracy 93.66% 59.66%

MUSIC GENRE CLASSIFICATION USING MACHINE LEARNING


17
SUDHEER PEDDINENI & SAMEER VITTALA (CBA BATCH 8)
5.2 Classification using Support Vector Machines (SVM)
A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating
hyperplane. In other words, given labeled training data (supervised learning), the algorithm outputs an
optimal hyperplane which categorizes new examples. In two dimensional space this hyperplane is a line
dividing a plane in two parts where in each class lay in either side.

The basic models of SVM are used for binary classification. In the current project, we have a multi class
classification problem and hence a variation of SVM known as SVM SVC(C-Support Vector Classification)
has been used. SVC implements the “one-against-one” approach (Knerr et al., 1990[18]) for multi- class
classification. If n_class is the number of classes, then n_class * (n_class - 1) / 2 classifiers are
constructed and each one trains data from two classes. To provide a consistent interface with other
classifiers, the decision_function_shape option allows to aggregate the results of the “one-against-one”
classifiers to a decision function of shape (n_samples, n_classes)

SVM model output on GTZAN and Million Song Data Set

GTZAN Million Song Data Set


No of Genres 10 15
No of Features extracted 340 50
No of Features after 9 14
dimensionality reduction
Test Accuracy 95.34% 63.32%

MUSIC GENRE CLASSIFICATION USING MACHINE LEARNING


18
SUDHEER PEDDINENI & SAMEER VITTALA (CBA BATCH 8)
5.3 Classification using Gaussian Naïve Bayes
A Gaussian Naive Bayes algorithm is a special type of Naïve Bayes (NB) algorithm. It is specifically used
when the features have continuous values. It is assumed that all the features are following a Gaussian
distribution i.e, normal distribution.

Naïve Bayes model output on GTZAN and Million Song Data Set

GTZAN Million Song Data Set


No of Genres 10 15
No of Features extracted 340 50
No of Features after 9 14
dimensionality reduction
Test Accuracy 95.33% 57.68%

MUSIC GENRE CLASSIFICATION USING MACHINE LEARNING


19
SUDHEER PEDDINENI & SAMEER VITTALA (CBA BATCH 8)
5.4 Classification using Random Forest Classifier
Random forests or random decision forests are an ensemble learning method for classification,
regression and other tasks. They operate by constructing a multitude of decision trees at training time
and outputting the class that is the mode of the classes (classification) or mean prediction (regression)
of the individual trees. Random decision forests correct for decision trees' habit of overfitting to their
training set.

Random Forest model output on GTZAN and Million Song Data Set

GTZAN Million Song Data Set


No of Genres 10 15
No of Features extracted 340 50
No of Features after 9 14
dimensionality reduction
Test Accuracy 93% 60.65%

MUSIC GENRE CLASSIFICATION USING MACHINE LEARNING


20
SUDHEER PEDDINENI & SAMEER VITTALA (CBA BATCH 8)
5.5 K-Nearest Neighbor (KNN)
K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based
on a similarity measure (e.g., distance functions). A case is classified by a majority vote of its neighbors,
with the case being assigned to the class most common amongst its K nearest neighbors measured by a
distance function. If K = 1, then the case is simply assigned to the class of its nearest neighbor.

KNN model output on GTZAN and Million Song Data Set

GTZAN Million Song Data Set


No of Genres 10 15
No of Features extracted 340 50
No of Features after 9 14
dimensionality reduction
Test Accuracy 94.33% 57.39%

MUSIC GENRE CLASSIFICATION USING MACHINE LEARNING


21
SUDHEER PEDDINENI & SAMEER VITTALA (CBA BATCH 8)
MUSIC GENRE CLASSIFICATION USING MACHINE LEARNING
22
SUDHEER PEDDINENI & SAMEER VITTALA (CBA BATCH 8)
6 Model Comparison

Our experiments showed that for both the datasets, LDA followed by multiclass SVM (SVC: C-Support
Vector Classification) gave the highest test accuracy. While Gaussian Naïve Bayes did well for GTZAN
dataset, it fell way short in terms of accuracy for Million Song Dataset. For both the datasets SVM-SVC
stood to be the best classifier with highest Test accuracy.

We compared our experimental results against the research available for GTZAN genre classification and
found that our accuracies are much higher than the state-of-art models published [20]. The table below
shows the accuracy results of the state of the art genre classification models for GTZAN dataset:

System System Configuration Mean Accuracy


Decision Stumps 77.60%
AdaBFFs
Two-node trees 80%
Normalized features 85.50%
SRCAM
Standardized features 80.20%
Class-dependent covariances 75.40%
MAPsCAT
Total covariance 83%

MUSIC GENRE CLASSIFICATION USING MACHINE LEARNING


23
SUDHEER PEDDINENI & SAMEER VITTALA (CBA BATCH 8)
We have also compared our experimental results against other model accuracies published by various
research papers as shown below :

No of Best Model
Title Authors Institution Features
Genres & Accuracy
MFCC, Spectral
Archit Rathore - Centroid, Zero Ply Kernel
Music Genre 12152 Crossing SVM
IIT Kanpur 10
Classification[21] Margaux Dorido Rate,Chroma
- EXY1420 Frequencies, 78%
Spectral Roll Off
Music Genre
Classification and Multi class
Miguel
Variance Stanford SVM
Francisco, Dong MFCC,Chroma 10
Comparison on University
Myung Kim
Number of 35%
Genres[22]
Features are
Non-
Ioannis extracted from
Negative
MUSIC GENRE Panagakis, the cortical
Aristotle Tensor
CLASSIFICATION: A Emmanouil representation of
University of 10 Factorizati
MULTILINEAR Benetos, and sound using
Thessaloniki on(NTF)
APPROACH[23] Constantine multilinear
Kotropoulos subspace analysis
78.20%
techniques

MFCC, Mel
LDA with
Sudheer Spectrogram,
Music Genre Multi Class
Peddineni, Indian School Chroma CENS,
Classification using 10 SVM
Sameer Kumar Of Business Spectral Centroid,
Machine Learning
Vittala Spectral Contrast,
94.33%
Tonal Features

SVM-SVC with Radial basis function kernel is extensively used for various classification problems like
Pattern classification, gene classification etc. and found to be robust and scalable [19]

Since classification tasks require high accuracy, we recommend LDA followed by SVM-SVC to be the
model to be used for Audio genre classification using extracted features from raw audio files.

It is also worthwhile to note that KNN, Gaussian Naïve Bayes and Random Forest models also gave
test accuracies higher than 90 % for GTZAN dataset.

Our work can be further extended to develop an application where in the user interface would be
provided to allow the user to provide the url of any song or upload the song from local device. The

MUSIC GENRE CLASSIFICATION USING MACHINE LEARNING


24
SUDHEER PEDDINENI & SAMEER VITTALA (CBA BATCH 8)
underlying audio will be extracted and fed to the SVM model, which in turn will provide the genre of the
song as an output to the user.

MUSIC GENRE CLASSIFICATION USING MACHINE LEARNING


25
SUDHEER PEDDINENI & SAMEER VITTALA (CBA BATCH 8)
7 Challenges

GTZAN Data Set


Only 100 audio files are available per genre and the files do not capture the variations in each of the
genre from early 70s to the current state.

Most of the modern-day music is a fusion of multiple genre like blues+ classical, indie pop + metallic +
jazz etc. Both the datasets used have pure genre labels instead of fusion genre.

The GTZAN dataset and the genre are applicable for only western music and there are many other styles
like Indian, Asian, Middle Eastern music style etc., which, are not in the scope of the current project.

Million Song Data Set


MSD dataset has features extracted by echonest, certain features like Tonal features, Mel spectrogram,
Spectral contrast and Spectral centroid are completely missing which resulted in lower classification
accuracy.

Segment timber definition states that it is like MFCC+PCA but is not a representative as MFCC of librosa
package that was used to extract the low level feature of GTZAN data set.

Some of features like danceability, loudness, energy have many zeros and missing values making them
unusable

Lack of access to direct audio files of Million song tracks limits our ability to extract the features as we
could do for GTZAN dataset. This is one of the primary reason for the accuracy of models using MSD
dataset being less than 65 % as opposed to 90 % and above for GTZAN dataset.

The total number of tracks with genre are 1,91,000 and extracting features from HDF5 files is
computationally intensive which eventually drained our CPU resources.

MUSIC GENRE CLASSIFICATION USING MACHINE LEARNING


26
SUDHEER PEDDINENI & SAMEER VITTALA (CBA BATCH 8)
8 Conclusion/Recommendations
The features identified for GTZAN dataset have been very effective in discriminating the genre.

All the previous efforts were aimed at using existing features instead of transformed features, which
would not only reduce the number of dimensions but also be effective in discriminating the classes. LDA
has proved to be a very effective tool in reducing the number of dimensions as well as identifying the
components that can effectively discriminate the classes. We would strongly recommend use of
dimension reduction methods like LDA and QDA for future efforts of classification.

Having a sample of 100 audio files per decade per genre will be more effective to improve the predictive
accuracy of the models. Efforts must be made to accumulate such a collection.

Most of the modern-day music is a fusion of multiple genre like blues+ classical, indie pop + metallic +
jazz etc. Having additional fusion genre and related audio files will make the model more effective from
a commercialization perspective.

The GTZAN dataset and the genre are applicable for only western music and there are many other styles
like Indian, Asian, Middle Eastern music style. Preparing a dataset with sufficient samples encompassing
all the styles is a very humongous task but if accomplished will provide additional teeth to the model.

Access to all the raw audio files for all the tracks of MSD will be very useful in developing more accurate
and effective models based on Deep Learning.

MUSIC GENRE CLASSIFICATION USING MACHINE LEARNING


27
SUDHEER PEDDINENI & SAMEER VITTALA (CBA BATCH 8)
9 References

1. MIR Wikipedia : Link here


2. Survey Of Music Information Needs, Uses, And Seeking Behaviours: Preliminary Findings by Jin
Ha Lee, J. Stephen Downie published in 2004 in ISMIR. Link to article is here
3. “Classification accuracy is not enough” by Bob L. Sturm published in Journal of Intelligent
Information Systems. Link to article is here
4. “Visualising music: the problems with genre classification” by Janice Wong. Link to article is here
5. GTZAN Genre Collection. Link here
6. Automatic Musical Genre Classification Of Audio Signals by George Tzanetakis, Georg Essl and
Perry Cook. Department of Computer Science, Princeton University. Link to article is here
7. Million Song Data Set. Link here
8. Genre annotations for the Million Song Dataset. Link here
9. "Improving Genre Annotations for the Million Song Dataset" by Hendrik Schreiber, Tagtraum
Industries Incorporated. Link to article is here
10. Energy (signal processing) Wikipedia. Link here
11. Chroma feature Wikipedia. Link here
12. “Music type classification by spectral contrast feature.” by Jiang, Dan-Ning, Lie Lu, Hong-Jiang
Zhang, Jian-Hua Tao, and Lian-Hong Cai. Link to article here
13. “Detecting Harmonic Change In Musical Audio” by Christopher Harte and Mark Sandler, Centre
for Digital Music, Queen Mary, University of London, UK and Martin Gasser, Austrian Research
Institute for Artificial Intelligence (OFAI), Vienna, Austria. Link to article here
14. Linear discriminant analysis Wikipedia : Link here
15. 7 Websites for Music Lovers. Link to article here
16. "EVALUATION OF FEATURE EXTRACTORS AND PSYCHO-ACOUSTIC TRANSFORMATIONS FOR
MUSIC GENRE CLASSIFICATION" by Thomas Lidy, Andreas Rauber, Department of Software
Technology and Interactive Systems, Vienna University of Technology. Link to article here
17. “Music Genre Recognition” by Karin Kosina. Link to article here
18. “Single-layer learning revisited: A stepwise procedure for building and training a neural
network” by Knerr, S., Personnaz, L., and Dreyfus. Link to article here
19. “Support Vector Machines for Pattern Classification” by Shigeo Abe. Link to article here
20. “Classification accuracy is not enough” by Bob L. Sturm published in Journal of Intelligent
Information Systems. Link to article is here
21. Music Genre Classification by Archit Rathore, Margaux Dorido from Indian Institute of
Technology, Kanpur. Link to article is here
22. Music Genre Classification and Variance Comparison on Number of Genres by Miguel
Francisco,Dong Myung Kim. Link to article is here
23. MUSIC GENRE CLASSIFICATION: A MULTILINEAR APPROACH, Ioannis Panagakis, Emmanouil
Benetos, and Constantine Kotropoulos, Department of Informatics, Aristotle University of
Thessaloniki. Link to article is here

MUSIC GENRE CLASSIFICATION USING MACHINE LEARNING


28
SUDHEER PEDDINENI & SAMEER VITTALA (CBA BATCH 8)

You might also like