You are on page 1of 11

Million Song Dataset

Feature extraction with Spectral Analysis


Classification with k-Means algorithm
Data Set used (1)

MillionSongSubset from https://labrosa.ee.columbia.edu. 10000 songs (1%


from Million Song Dataset) selected random.
Data are in HDF5 format, which is a dedicated format to organize big data
arrays.
I have used a Matlab wrapper in order access the from the HDF5 files. This
wrapper was found on https://labrosa.ee.columbia.edu also.
Data Set used (2)

• Data for each song is wrapped in a .h5 . It looks like in the bellow pictures:
• There are no audio signal data, only metadata like year,
artist…
Input set

1000 arrays like in picture with ascii code of songs


name
Feature extraction using Spectral Analysis

• Features extraction means to create a projection form a M dimensional


space of the input features to N dimensional space (N < M). The new
features from the N dimensional spaces shall be uncorrelated.
• Spectral Analysis can be done using FFT, which is already implemented in
MATLAB. The function for FFT is fft();
Apply fft to input data

• we observe that only the first element has a


significant value
• we are going to select only 1st element from
each row from the input data.
Classification using K-means algorithm

• Classification using K-means algorithm means to group the input features in K


clusters using an iterative method.
• Steps for K-means algorithm are next ones:
• Set randomly K centroids in input features spaces.
• Calculate distances from each features to the all centroids and assign the feature to the
closest one.
• Recalculate the centroids based on the features in each cluster.
• Repeat until convergence (there is no more features which change the cluster from they
appear)
K Means Clustering

http://rossfarrelly.blogspot.ro/2012/12/k-meansclustering.html
Weakness of K-means Algorithm

• It is not robust to outliners. Very far data from the centroid, will pull the centroid away from
the real one
• The result is circular cluster shape because is based on distance
• Sensitive to initial condition. Different initial condition may produce different result of
cluster. The algorithm may be trapped in the local optimum.
• When the numbers of data are not so many, initial groping will determine the cluster
significantly

http://people.revoledu.com/kardi/tutorial/kMean/Weakness.htm
Thank you!

You might also like