You are on page 1of 17

Data & Knowledge Engineering 92 (2014) 6076

Contents lists available at ScienceDirect

Data & Knowledge Engineering


journal homepage: www.elsevier.com/locate/datak

Editorial

Music genre classication based on local feature selection using


a self-adaptive harmony search algorithm
Yin-Fu Huang , Sheng-Min Lin, Huan-Yu Wu, Yu-Siou Li
Department of Computer Science and Information Engineering, National Yunlin University of Science and Technology, 123 University Road, Section 3, Touliu, Yunlin 640,
Taiwan, ROC

a r t i c l e

i n f o

Article history:
Received 1 February 2012
Received in revised form 16 June 2014
Accepted 4 July 2014
Available online 15 July 2014
Keywords:
Classication
Information retrieval
Feature selection
Harmony search algorithm

a b s t r a c t
This paper proposes an automatic music genre-classication system based on a local featureselection strategy by using a self-adaptive harmony search (SAHS) algorithm. First, ve acoustic
characteristics (i.e., intensity, pitch, timbre, tonality, and rhythm) are extracted to generate an
original feature set. A feature-selection model using the SAHS algorithm is then employed for
each pair of genres, thereby deriving the corresponding local feature set. Finally, each oneagainst-one support vector machine (SVM) classier is fed with the corresponding local feature
set, and the majority voting method is used to classify each musical recording. Experiments on
the GTZAN dataset were conducted, demonstrating that our method is effective. The results
show that the local-selection strategies using wrapper and lter approaches ranked rst and
third in performance among all relevant methods.
2014 Elsevier B.V. All rights reserved.

1. Introduction
Recently, the development of digital audio techniques has matured. Through the Internet, increasingly more online music systems,
such as 7digital and Amazon MP3, have been developed to provide abundant music services and copyrighted music downloads for
consumers. In general, to facilitate music searches, these systems usually categorize music by using several tags such as blues, classical,
and country. However, because a vast amount of music is stored in these systems, manually tagging music is time-consuming. Furthermore, most people have different cognitions of music. Therefore, developing an automatic music genre-classication system is
necessary for the tagging process to be more effective and become standardized. This paper proposes a music genre-classication system, because genre tags are employed for most musical content descriptions [2].
In other studies, most music genre-classication systems have extracted audio features to achieve satisfactory performance. In
general, audio features are long-term features that have been obtained by estimating the overall statistics of short-term features, or
are directly estimated for long-term use based on audio-feature descriptions of audio signals. Therefore, conventional bag-offrames approaches [17,20,26,35] have been proposed and employed to generate feature sets of classication, such as the Marsyas
framework [35], which contains several features regarding timbre texture, rhythmic content, and pitch content; the genre collection
in this framework is called the GTZAN dataset and is frequently used to compare the effectiveness of different systems
[1,3,18,22,2931]. Moreover, some researchers have focused on proposing rened features, such as modulation spectral analysis
[15,22], to improve performance. In addition, the auditory model proposed by [37] maps a musical recording to obtain a 3D representation of its slow spectral and temporal modulations. In our study, we also adopted a bag-of-frames approach and focused on ve

Corresponding author. Tel.: +886 5 5342601x4314; fax: +886 5 5312170.


E-mail address: huangyf@yuntech.edu.tw (Y.-F. Huang).
URL: http://mdb.csie.yuntech.edu.tw (Y.-F. Huang).

http://dx.doi.org/10.1016/j.datak.2014.07.005
0169-023X/ 2014 Elsevier B.V. All rights reserved.

Y.-F. Huang et al. / Data & Knowledge Engineering 92 (2014) 6076

61

acoustic characteristics (i.e., intensity, pitch, timbre, tonality, and rhythm) to extract useful audio features and form an original feature
set for music genre classication.
To obtain sufcient knowledge for classication, generated feature sets commonly contain abundant information, likely with redundant features. To solve this problem, dimensionality reduction techniques are frequently employed and can be classied into
two approaches. The rst approach is to transform a feature-set matrix from a high-dimensional space to a lower dimensional
space through a linear combination of matrix-using techniques, such as principal component analysis (PCA) [16], non-negative
multi-linear principal component analysis (NMPCA) [31], non-negative matrix factorization (NMF) [21], and non-negative tensor factorization (NTF) [30]. The second approach is called feature selection, which determines an optimal subset from the original feature
set by using search algorithms, such as genetic algorithms (GAs) [19], ant colony optimization (ACO) [7], harmony search (HS) [8],
adaptive binary harmony search [36], and support vector machine (SVM) recursive feature elimination (SVM-RFE) [11]. Both approaches can effectively reduce the dimensions of a feature set. In this study, we adopted a feature-selection approach based on
the self-adaptive harmony search (SAHS) algorithm [14]; the approach was validated, thus leading to an optimal solution. In our
study, we extensively collected useful features and determined which features were more relevant in music genre classication. In
general, feature-selection approaches are employed to select a global-feature set for all music genres. However, to achieve global optimization, we adopted a local-selection strategy based on each pair of music genres because it can derive local feature sets that are
more relevant when compared with a global-selection strategy. Finally, we veried that using the SAHS achieves higher performance
compared to other methods.
For prediction, frequently employed classiers, such as multi-layer perceptrons (MLPs) [27,32], SVMs [1,15,27,31,32], and linear
discriminant analysis (LDA) [15,18,22], have been used to determine optimal linear combinations that discriminate the vector of a feature set for different classes. In this study, the SVM classier was adopted because, in general, it demonstrates higher performance
than other classiers do [15,31,32] when kernel functions and parameters are appropriately chosen. In this study, we matched
each local feature set with an SVM classier and used the majority voting method to classify each musical recording. Experiments
on the GTZAN dataset were conducted, and the results demonstrated that our method is effective. The results showed that the
local-selection strategies, which involve adopting two approaches, ranked rst and third in performance among all relevant methods.
The remainder of the paper is organized as follows. Section 2 presents the system architecture and briey describes the music classication procedures. Section 3 introduces ve acoustic characteristics and presents the extracted features that form an original feature set. Section 4 describes the SAHS algorithm and correlation measuring method used to derive an optimal feature subset from the
original feature set. Section 5 presents the experimental results of differing selection strategies and approaches. Finally, we offer our
conclusion in Section 6.

2. System overview
This paper proposes an automatic music genre-classication system that uses suitable features selected using a meta-heuristic optimization algorithm called the SAHS algorithm [14] for classifying music genres. The feature-selection model is applied for each oneagainst-one classier to classify ambiguous genres precisely.

Fig. 1. System architecture.

62

Y.-F. Huang et al. / Data & Knowledge Engineering 92 (2014) 6076

2.1. Support vector machine


SVMs, which Boser et al. rst investigated in 1992 [4], are used to solve linearly inseparable problems by non-linearly mapping a
vector in a low-dimensional space to a higher dimensional feature space, and by constructing an optimal hyperplane in the higher dimensional space. Therefore, SVMs are good candidates in data classication. A classication task usually involves training and test data
that consist of data instances. Each instance in the training set contains one target value (i.e., class labels) and several attributes
(i.e., features). The goal of an SVM is to produce a model that can predict the target value of data instances in a test set by employing
only the attributes.
2.2. System description
As presented in Fig. 1, the system architecture consists of two parts: the training and test phases. In the training phase, we rst
retrieve all audio features from the training set. These audio features (i.e., intensity, pitch, timbre, tonality, and rhythm) are used to
describe audio signals. For the original feature set, we use the feature-selection model to remove irrelevant features from each pair
of genres to obtain their respective optimal feature sets (i.e., a local feature set). For N genres, N(N-1)/2 local feature sets are generated. The selection model employs the SAHS algorithm to obtain an optimal solution in which the ratio of the intra-correlation to
inter-correlation of selected features is maximized using a lter approach. After obtaining all local feature sets for all pairs of genres,
each one-against-one SVM is trained using the corresponding local feature set and classes. Finally, all trained SVMs are combined to
form an SVM ensemble model that is later used in the test phase.
In the test phase, we rst retrieve all audio features from the test set, and feed each one-against-one SVM with the corresponding
local feature. For N(N-1)/2 local predictions, we then use the majority voting method to determine the genre to which a test sample
belongs.
3. Feature extraction
In our study, feature extraction was a signicant process because audio features containing sufcient music information had to be
collected for a later selection process. In general, audio signals can be measured in two different ways: through time and frequency
domains. Regarding audio features in the time domain, audio signal samples are directly processed over time for observing amplitude
characteristics such as intensity and rhythm. Regarding the frequency domain, each amplitude sample is transferred from the time
domain to a corresponding frequency band in a spectrum through the discrete Fourier transform (DFT). Additional detailed characteristics of acoustics, such as pitch, timbre, and tonality, are obtained by estimating the spectrum. In music classication, the ve
acoustic characteristics of intensity, pitch, timbre, tonality, and rhythm have their own distinctions in various music genres. Based
on these two domains, we used the ve acoustic characteristics to generate an original feature set.
Recently, researchers have developed and provided useful feature extraction frameworks. First, the Marsyas framework developed
by Tzanetakis and Cook [35] consists of 30 descriptive features of rhythm, pitch, and timbre for music classication. Second, the MPEG7 audio descriptors [17] developed by MPEG provide 17 audio descriptors of intensity, pitch, and timbre for retrieving audio information. Third, both the MIRToolbox developed by Lartillot and Toiviainen [20] and the jAudio software developed by McEnnis et al. [26]
provided integrated feature frameworks of the ve acoustic characteristics, including descriptors in Marsyas and MPEG-7. Thus, we
collected the audio descriptors provided in MIRToolbox, jAudio, and MPEG-7 [5,24,25] to form our original feature set, which includes
265 features from 32 audio descriptors. Most descriptors are measured in a frame unit or analysis window of 23 ms. However, some
descriptors may consist of multiple dimensions depending on the spectral frequency bands. Thus, the term texture window is used
in this paper to describe this larger window, which ideally corresponds to the minimum amount of time and sound required to identify a particular sound or music texture [35]. Rather than using the feature values directly, the parameters of a running multidimensional Gaussian distribution are estimated. Specically, we calculated the mean and standard deviation of all frames in a texture
window of 30 s to form two statistical features. Furthermore, for most descriptors, we also calculated the mean and standard deviation
of all differences between adjacent frames to form two new features. The following section explains each acoustic characteristic and
its representative features.
3.1. Intensity
In acoustics, intensity is also referred to as loudness, volume, or energy. Intensity is the most obvious feature and indicates the
sounds that are heard by human ears; it is commonly measured according to the amplitude within a frame in the time domain. Higher
amplitude indicates higher intensity. In general, the unit of intensity is represented in decibels (dB). In music, genres that are more
intense, such as heavy metal, commonly have higher intensity. We considered eight intensity features (Table 1).
Table 1
Intensity features.
No.

Feature description

Dim.

Overall statistics

Total number

1
2

RMS amplitude
RMS number of low energy frames

1
1

4
4

4
4

Y.-F. Huang et al. / Data & Knowledge Engineering 92 (2014) 6076

63

Table 2
Pitch features.
No.

Feature description

Dim.

Overall statistics

Total number

3
4
5
6
7

Strongest frequency via autocorrelation


Strongest frequency via zero crossings
Strongest frequency via spectral centroid
Strongest frequency via FFT maximum
Inharmonicity rate

1
1
1
1
1

4
4
4
4
1

4
4
4
4
1

For estimating the local intensity of a frame, root mean square (RMS) amplitude is used employing the root of the mean-square
amplitude. In addition, the RMS number of low-energy frames is used to estimate the root of the mean-square number of frames
less than the average RMS amplitude in a texture window.
3.2. Pitch
Pitch is the high and low frequency produced by the vibration number of a sound body such as a chord. The sound that we hear is
the fundamental frequency and harmonic series. In other words, pitch is the fundamental frequency of sound. Regarding human perception, higher vibration numbers indicate higher frequencies and present brighter sounds, and vice versa. In this paper, we consider
17 pitch features (Table 2).
For estimating the fundamental frequency of a frame, several of the strongest frequencies are obtained using four various methods
(i.e., autocorrelation, zero crossings, spectral centroid, and FFT maximum). In addition, the inharmonicity rate is an estimate of the
percentage of frequencies that are not multiples of the fundamental frequency in a texture window.
3.3. Timbre
Sounds with the same intensity and pitch can be distinguished according to timbre, which enables such sounds to have unique
characteristics. Different sound bodies or instruments produce differing types of vibrations in varying ways through the use of diverse
materials to form a unique harmonic series, which are multiples of the fundamental frequency. Because each music genre consists of
various instruments, each genre exhibits distinct timbre characteristics. Timbre is presented by different structures in the amplitude
spectrum on each or all frequency bands. Estimating the spectrum of audio signals can derive timbre characteristics. We considered
200 timbre features (Table 3).
Zero crossing is the number of signals crossing the zero line in a graph. The spectral roll-off is the estimate of the amount of high
frequency, which is determined by establishing a cutoff frequency below which a certain fraction of total energy is contained. The
spectral ux is the estimate of the distance of the spectrum between adjacent frames. The spectral centroid is a measure that indicates
the centroid frequency of the spectrum. Spectral variability is the variation degree of the neighboring peaks of the spectrum, and spectral spread shows the dispersion or spread range of the spectrum. Harmonic spectral smoothness is a calculation of how smooth the
spectrum is through peaks. The audio spectrum envelope is a reduced spectrogram that is obtained by summing the energy within a
series of log-scale frequency bands. Audio spectrum atness reects the atness properties of the spectrum that are determined by
estimating the ratio between the geometric mean and the arithmetic mean within a series of frequency bands. The audio spectrum
centroid shows the center of gravity of a log-frequency spectrum, the method of moments provides ve statistical methods for describing the spectral shape, and compactness is closely related to harmonic spectral smoothness but is estimated according to amplitudes. Finally, MFCC facilitates describing the spectral shape based on mel-frequency.

Table 3
Timbre features.
No.

Feature description

Dim.

Overall statistics

Total number

8
9
10
11
12
13
14
15
16
17
18
19
20

Zero crossing
Spectral rolloff
Spectral ux
Spectral centroid
Spectral variability
Spectral spread
Harmonic spectral smoothness
Audio spectrum envelope
Audio spectrum atness
Audio spectrum centroid
Method of moments
Compactness
MFCC

1
1
1
1
1
1
1
24
22
1
5
1
13

4
4
4
4
4
4
4
2
2
4
4
4
4

4
4
4
4
4
4
4
48
44
4
20
4
52

64

Y.-F. Huang et al. / Data & Knowledge Engineering 92 (2014) 6076

Table 4
Tonality features.
No.

Feature description

Dim.

Overall statistics

Total number

21
22
23

Key strength
Harmonic change
Tonal centroid

1
1
6

2
2
2

2
2
12

3.4. Tonality
Tonality is the arrangement of all of the tones and chords of a composition related to a tonic; for instance, E major is composed of E,
F#, G#, A, B, C#, and D#. Tonality consists of 24 major or minor diatonic scales. Audio signals can be transformed to a chromagram
containing all diatonic scales [20]. Tonality can be measured according to the center of the diatonic scales in the chromagram.
Table 4 presents 16 tonality features.
The key strength is an estimate of the tonal center positions and their respective clarity; harmonic change is the estimate of the ux
of the tonal centroid between adjacent frames; and the tonal centroid calculates the six-dimensional tonal centroid vector from the
chromagram.

3.5. Rhythm
Rhythm is a time-related characteristic of sound, consisting of beats and tempos. A beat is related to notes and commonly produced by striking or hitting actions; it is usually measured by the amplitude peaks. In music, multiple strong and weak beats usually
form a regular period called a tempo. In general, a tempo is measured according to the beats per minute (BPM). Tempos can be used to
distinguish music genres (i.e., high tempos are used in hip-hop and disco, and low tempos are used in classical and country music).
Table 5 presents 24 rhythm features.
The beat spectrum is an estimate of acoustic self-similarity as a function of time lag; peak strength is an estimate of the local maxima that are selected as the temporal duration of onset positions; and attack time estimates the temporal duration from valleys to
peaks. In addition, features, such as the strongest beat, the beat sum, and the strength of the strongest beat, present different beat
values based on a beat histogram. We used the relative difference function to estimate signicant changes in a signal relative to its
signal level. In a texture window, the autocorrelation of a tempo is used to estimate the autocorrelation function of the onset detection
curve, and the tempo in the spectrum is used to estimate the spectral decomposition of the onset detection curve.

4. Feature selection using the SAHS algorithm


The purpose of feature selection is to select the most relevant features that facilitate classication by using subset selection algorithms, which are categorized into three approaches: (a) wrapper, (b) lter, and (c) embedded algorithms [12]. Wrappers utilize the
learning machine of interest as a black box to score subsets of variable according to their predictive power. Filters select subsets of
variables as a pre-processing step, independently of the chosen predictor. Embedded methods perform variable selection in the process of training and are usually specic to given learning machines. Here, we used the lter approach to select the most relevant features because their computational loading was acceptable and the selected classication algorithm could be generalized. In our study,
the feature-selection model consisted of two parts: (a) the SAHS algorithm [14] and (b) relative correlations (Fig. 2). Once an original
feature set is provided, the SAHS algorithm begins iteratively searching for a better solution that is then evaluated according to relative
correlations. The best solution is then output as the nal feature subset.

Table 5
Rhythm features.
No.

Feature description

Dim.

Overall statistics

Total number

24
25
26
27
28
29
30
31
32

Beats spectrum via self-similarity


Beats strength
Attack time
Strongest beat via beat histogram
Beat sum via beat histogram
Strength of strongest beat via beat histogram
Relative difference function
Tempo in autocorrelation
Tempo in spectrum

1
1
1
1
1
1
1
1
1

2
2
2
4
4
4
4
1
1

2
2
2
4
4
4
4
1
1

Y.-F. Huang et al. / Data & Knowledge Engineering 92 (2014) 6076

65

Fig. 2. Feature selection model.

4.1. Harmony search algorithm


This subsection briey reviews the HS algorithm and recently developed variants. We then introduce an HS variant (i.e., the SAHS),
of which the parameters are automatically adjusted. Accordingly, tuning control parameters is unnecessary, thereby leading to a nearly parameter-free HS algorithm.
4.1.1. General harmony search algorithm
The general procedures of an HS are as follows [10].
Step 1 Create and randomly initialize an HMS-size harmony memory (HM).
Step 2 Improvise a new harmony from the HM.
Step 3 Update the HM. If the new harmony is better than the worst harmony in the HM, include the new harmony in the HM, and
exclude the worst harmony from the HM.
Step 4 Repeat Steps 2 and 3 until the maximum number of iterations is reached.
An HM is a set of solution vectors, and viewing the HM as a population of a GA is convenient. The HS is governed according to three
rules in Step 2: (1) random selection, (2) memory consideration, and (3) pitch adjustment (Fig. 3). The function ran() shown in Fig. 3
and Fig. 4 is a random number generator generating a number in the range [0, 1].
The HS algorithm is a metaheuristic algorithm that imitates the improvisation process of musicians. First, an HMS-size HM is created and randomly initialized. To establish better harmony (i.e., a better feature subset), each musician (i.e., a variable or feature) plays
a note (i.e., a value in the range of the feature) to improvise a new harmony. In our feature-selection model, the possible value of variables is either 0 or 1, which respectively represents whether the corresponding feature is selected. Once a new harmony is generated,
its performance must be evaluated through an objective function; in this paper, we call this the relative correlation function. If the new
harmony is better than the worst harmony in the HM, the worst harmony is replaced with the new harmony. New harmonies continue to be improvised until the maximum iterations are reached.
As shown in Fig. 3, the trial is the indicated pitch of a note played by a musician. The harmony memory consideration rate (HMCR)
controls random selection and memory consideration by adjusting the value between 0 and 1, thereby indicating exploration or exploitation, respectively. If the HMCR is close to 0, the pitch is randomly selected from the entire possible range [LB, UB] of variables; in
other words, a random search is done from a set of variables indexed from the lower bound (LB) to the upper bound (UB) of the range.

Fig. 3. New harmony improvisation process.

66

Y.-F. Huang et al. / Data & Knowledge Engineering 92 (2014) 6076

Fig. 4. Pitch adjustment step for SAHS.

Conversely, the pitch tends to be selected from the HM because the HM preserves the best harmony. In addition, the pitch adjustment
rate (PAR) determines whether the pitch should be further adjusted according to a variable distance bandwidth (bw), which represents the range of a local search. In other words, the pitch adjustment step is similar to the local search mechanism, and the variable
distance bw is its step size, thereby indicating that PAR and bw have a great inuence on the quality of the nal solutions.
4.1.2. Improved harmony search algorithm
Because the PAR and bw in the HS algorithm control the convergence rate and the ability for ne-tuning, Mahdavi et al. [23] proposed a variant of the HS, called the improved HS (IHS), to increase the PAR dynamically and decrease the bw, respectively. The IHS
eliminates the weaknesses caused by having PAR and bw xed values in the HS algorithm.
The bw has a considerable inuence on the precision of a solution and should be problem dependent. Therefore, decreasing the bw
by an iteration number could ne-tune the nal solution. This philosophy is identical to dynamically decreasing the learning rate of
neural networks. However, determining how to select a suitable set of bwmin and bwmax becomes a problem.
Nevertheless, continuously increasing the PAR in the HS might be a questionable strategy for several reasons. Because the PAR controls the probability of either selecting a pitch from the HM randomly or further adjusting a selected pitch, we believe that a successful
search for the ideal harmony should progress from the beginning iteration, and then gradually become more limited. Thus, the PAR
should be decreased over time to prevent overshooting and oscillation. In addition, this questionable strategy apparently contradicts
the mentioned statements regarding bw. Moreover, the global-best harmony search (GHS) [28] with a relatively small constant PAR
instead of a variable, which was proposed by the same authors, obtained better results.
4.1.3. Global-best harmony search algorithm
A recently developed variant of the HS is the GHS [28], borrowing concepts from swarm intelligence to enhance performance. The
GHS directly adopts the current best pitch from the HM to simplify pitch adjustment, thereby eliminating bw selection difculties. Although this variant algorithm seems exceptional, it has numerous problems.
Because an HS belongs to neighborhood search metaheuristics, it would use its own experiences to search a new pitch. For this
reason, neither swarms nor relative global concepts exist in the HS. The term global-best appears misused and confuses other researchers. In addition, although using the current best pitch from the HM eliminates the necessity of using the bw, it causes a serious
side effect: premature convergence. Moreover, several obvious mistakes exist in the GHS and the reliability of the numerical results is
reduced.
4.1.4. Self-adaptive harmony search algorithm
Because the values of the PAR and bw have considerable inuence on the nal solutions, the SAHS algorithm attempts to improve
the PAR and bw to provide better nal solutions. The algorithm modies the pitch adjustment step of the HS algorithm so that a new
harmony is derived from its own experiences in the HM. More precisely, the SAHS algorithm replaces the parameter bw completely
and updates the new harmony according to the minimal and maximal values in the HM. As shown in Fig. 4, the pitch adjustment step
adjusts trial2, which is the pitch selected from the HM of a variable. The min(HM) and max(HM) are the minimal and maximal values
of a variable in the HM. Because this mechanism can progressively make ner adjustments to the harmony, the new harmony would
approach the optimum gradually. Furthermore, the pitch adjustment using this mechanism does not violate the boundary constraint
of the variables (i.e., bw).
4.2. Relative correlations
When each new harmony, which symbolizes a selected feature subset, is generated from the SAHS algorithm, the relative correlations are used to evaluate the performance of the selected feature subset. The correlations of the selected feature subset are conducted in two phases: (1) intra-correlation is used to evaluate the mutual correlation between features within the subset, and (2) intercorrelation is used to compare each feature inside the subset with the corresponding class. If a subset demonstrates higher

Y.-F. Huang et al. / Data & Knowledge Engineering 92 (2014) 6076

67

Table 6
Confusion matrix on the original feature set.

Predicted
Blues

Classical

Country

Disco

Hiphop

Jazz

Metal

Pop

Reggae

Rock

Recall

Actual
Blues

91

91%

Classical

92

92%

Country

79

13

79%

Disco

76

76%

Hiphop

89

89%

Jazz

89

89%

Metal

90

90%

Pop

88

88%

Reggae

12

76

76%

Rock

13

73

73%
84.3%

Precision

90.1%

94.8%

77.5%

80%

85.6%

92.7%

93.8%

80%

82.6%

68.2%
84.5%

performance, then it must possess the property of lower intra-correlation and higher inter-correlation. Lower intra-correlation indicates that the features within the subset are relevant, whereas higher inter-correlation signies that each feature within the subset is
discriminative for the corresponding class.
In this study, we adopted a measuring formula called mutual information [33], which is used to evaluate the degree of the mutual
dependence between two variables. The denition for discrete random variables is shown as follows:

I X; Y xX yY px; ylog


px; y
p1 xp2 y

where p(x, y) is the joint probability distribution function of X and Y, and p1(x) and p2(y) are the marginal probability distribution
functions of X and Y, respectively. The value of I(X; Y) is nonnegative; if I(X; Y) is zero, then X and Y are independent; otherwise, a
high value indicates a strong dependence between X and Y.
Table 7
Confusion matrix by the global selection strategy.

Predicted
Blues

Classical

Country

Disco

Hiphop

Jazz

Metal

Pop

Reggae

Rock

Recall

Actual
Blues

95

95%

Classical

94

94%

Country

87

87%

Disco

86

86%

Hiphop

91

91%

Jazz

91

91%

Metal

94

94%

Pop

96

96%

Reggae

84

84%

Rock

12

77

77%
89.5%

Precision

92.2%

96.9%

79.8%

86.9%

92.9%

93.8%

96.9%

89.7%

91.3%

76.2%
89.7%

68

Y.-F. Huang et al. / Data & Knowledge Engineering 92 (2014) 6076

Table 8
Number of features in each local feature set.
Genre

Blues

Classical

Country

Disco

Hiphop

Jazz

Metal

Pop

Reggae

Rock

Blues
Classical
Country
Disco
Hiphop
Jazz
Metal
Pop
Reggae
Rock

85
*

80
87
*

89
90
88
*

85
77
79
87
*

75
94
77
84
87
*

79
87
85
74
73
82
*

80
78
87
81
77
82
76
*

77
86
76
88
78
81
83
86
*

104
81
89
86
94
90
88
78
77
*

The intra-correlation within the feature subset is shown as follows:


RIS

jSj X
jSj


X
1
I xi ; x j
C jSj; 2 i1 ji1

where C(|S|,2) is the cardinality of the two combinations of feature subset S. The overall correlation within subset S is divided by
C(|S|,2) to present the average correlation between features within subset S. The inter-correlation between the feature subset
and the corresponding class is shown as follows:
1X
I xi ; y
jSj i1
jSj

RT S; y

where |S| is the cardinality of feature subset S, and y is the output class. The overall correlation is divided by the cardinality to
derive the average correlation between features and the corresponding class.
Finally, the relative overall correlation combining both the intra-correlation and inter-correlation [13] is shown as follows:
k  RTS; y
RCS; y p
k k  k1  RI S

where k is the cardinality of feature subset S; it not only indicates that a higher RC value has a better feature subset (i.e., it has lower
intra-correlation and higher inter-correlation), but also balances the effect of the average correlation on different cardinalities of the
candidate feature subsets. Consequently, RC was chosen as the objective function in the feature-selection model.

Table 9
Confusion matrix by the local selection strategy.

Predicted
Blues

Classical

Country

Disco

Hiphop

Jazz

Metal

Pop

Reggae

Rock

Recall

Actual
Blues

99

99%

Classical

100

100%

Country

87

87%

Disco

86

86%

Hiphop

94

94%

Jazz

94

94%

Metal

97

97%

Pop

97

97%

Reggae

85

85%

Rock

83

83%
92.2%

Precision

90.8%

96.2%

84.5%

91.5%

93.1%

98.9%

97.9%

92.4%

92.4%

84.7%
92.2%

Y.-F. Huang et al. / Data & Knowledge Engineering 92 (2014) 6076

69

5. Experimental results
In this study, we conducted experiments on the GTZAN dataset [35], which is public and has been extensively employed in studies
on music classication. The dataset consists of 10 music genres: blues, classical, country, disco, hip-hop, jazz, metal, pop, reggae, and
rock. Each genre used in the experiments consists of 100 thirty-second-long audio recordings with a sample rate of 22,050 Hz; we extracted each frame (or analysis window) of 512 samples (approximately 23 ms) and 128 of these samples overlapped with adjacent
frames. Here, to be fair in comparing our methods with other approaches, we follow the base line not using an artist lter [9] before

1
0.9
0.8
0.7
0.6

intensity

0.5

mbre

0.4

pitch

0.3

tonality

0.2
rhythm

0.1
0
blu. blu. blu. blu. blu. blu. blu. blu. blu. cla. cla. cla. cla. cla. cla.
vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs.
cla. cou. dis. hip. jaz. met. pop reg. roc. cou. dis. hip. jaz. met. pop
1
0.9
0.8
0.7
0.6

intensity

0.5

mbre

0.4

pitch

0.3

tonality

0.2

rhythm

0.1
0
cla. cla. cou. cou. cou. cou. cou. cou. cou. dis. dis. dis. dis. dis. dis.
vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs.
reg. roc. dis. hip. jaz. met. pop reg. roc. hip. jaz. met. pop reg. roc.
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0

intensity
mbre
pitch
tonality
rhythm
hip. hip. hip. hip. hip. jaz. jaz. jaz. jaz. met. met. met. pop pop reg.
vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs.
jaz. met. pop reg. roc. met. pop reg. roc. pop reg. roc. reg. roc. roc.
Fig. 5. Feature distribution in each local feature set using the lter approach.

70

Y.-F. Huang et al. / Data & Knowledge Engineering 92 (2014) 6076

Table 10
Number of features in each local feature set using the wrapper approach.
Genre

Blues

Classical

Country

Disco

Hiphop

Jazz

Metal

Pop

Reggae

Rock

Blues
Classical
Country
Disco
Hiphop
Jazz
Metal
Pop
Reggae
Rock

151
*

141
158
*

136
132
131
*

154
162
150
148
*

134
136
150
162
149
*

157
154
157
154
145
147
*

128
161
150
133
154
133
160
*

146
152
142
143
138
144
154
129
*

135
148
148
159
125
145
160
143
136
*

* The asterisks indicate the diagonal line of a matrix.

conducting music genre classication. In the SAHS algorithm, parameters HMS and HMCR are set at 50 and 0.99 to present the optimal
performance. Next, we adopted the LIBSVM developed by Chang and Lin [6] as the SVM classier. We used the radial basis kernel function (i.e., RBF) because it is more accurate and effective than other kernel functions. Parameters and C are determined according to
the optimal performance of 6 6 combinations between [24, , 21] and [22, , 23]. Moreover, each feature is normalized in the
range [1, 1]. Following previous studies, all of our presented classication results were evaluated using ten-fold cross validation. Regarding the classication results, a confusion matrix was employed to present all statistics regarding accurate and false predictions
after cross-validation. Precision, recall, and accuracy were estimated as performance evaluation and are shown as follows:

NC
NC N F
NC
Recall
NC NM
totalc
Accuracy
totalm

Precision

where NC is the number of accurately predicted music tracks, NF is the number of falsely predicted music tracks, NM is the number of
missed music tracks, totalc is the number of all accurately predicted music tracks, and totalm is the number of all music tracks. The evaluations of precision and recall are shown in the last row and column of the confusion matrices.
In this section, we present the classication results of the original feature set. We then used feature selection (global and local selection) to present the improvements of the classication results. Finally, a comparison of the results of our method and that of other
studies is presented. Besides, we also conducted the experiments with the artist lter to investigate their accuracy.
Table 11
Confusion matrix by the local selection strategy using the wrapper approach.

Predicted
Blues

Classical

Country

Disco

Hiphop

Jazz

Metal

Pop

Reggae

Rock

Recall

Actual
Blues

100

100%

Classical

100

100%

Country

97

97%

Disco

95

95%

Hiphop

97

97%

Jazz

99

99%

Metal

99

99%

Pop

100

100%

Reggae

95

95%

Rock

90

90%

99%

98%

97.2%
Precision

97.1%

98%

94.2%

95.9%

97.9%

96.2%

95%

95.7%
96.7%

Y.-F. Huang et al. / Data & Knowledge Engineering 92 (2014) 6076

71

0.9
0.8
0.7
0.6

intensity

0.5
mbre

0.4

pitch

0.3
0.2

tonality

0.1

rhythm

0
blu. blu. blu. blu. blu. blu. blu. blu. blu. cla. cla. cla. cla. cla. cla.
vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs.
cla. cou. dis. hip. jaz. met. pop reg. roc. cou. dis. hip. jaz. met. pop
0.9
0.8
0.7
0.6

intensity

0.5
mbre

0.4

pitch

0.3
0.2

tonality

0.1

rhythm

0
cla. cla. cou. cou. cou. cou. cou. cou. cou. dis. dis. dis. dis. dis. dis.
vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs.
reg. roc. dis. hip. jaz. met. pop reg. roc. hip. jaz. met. pop reg. roc.
0.9
0.8
0.7
0.6
intensity

0.5

mbre

0.4

pitch
0.3
tonality
0.2

rhythm

0.1
0
hip. hip. hip. hip. hip. jaz. jaz. jaz. jaz. met.met.met. pop pop reg.
vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs.
jaz. met. pop reg. roc. met. pop reg. roc. pop reg. roc. reg. roc. roc.
Fig. 6. Feature distribution in each local feature set using the wrapper approach.

5.1. Classication results of the original feature set


In this experiment, the original feature set that included 265 features was used as the feature vector. The classication results are
presented in Table 6, and the overall classication accuracy is approximately 84.3%. This result demonstrates that the original feature
set is practical for music classication but unsuitable for some genres, such as country, disco, reggae, and rock, because the precision or

72

Y.-F. Huang et al. / Data & Knowledge Engineering 92 (2014) 6076

recall is less than 80%. Lower precision or recall indicates that the disturbance of several unsuitable features in the original feature set
results in an inability to discriminate these genres clearly.
5.2. Classication results by feature selection
Three classication results, which were obtained by using feature selection, are presented in the following sections. First, the feature set selected based on the global-selection strategy is presented for the 10 genres. Next, each feature set selected based on the
local-selection strategy using the lter approach is presented for each pair of genres. Finally, each feature set selected based on the
local-selection strategy using the wrapper approach is presented for each pair of genres.
5.2.1. Global selection strategy
In this experiment, a global-feature set for 10 genres was selected, consisting of 114 features (i.e., 3 intensity, 5 pitch, 84 timbre, 10
tonality, and 12 rhythm features). For the classication results presented in Table 7, the overall accuracy is approximately 89.5%, and
indicates an improved accuracy of 5.2% compared to the results presented in Table 6. All precision and recall results showed improvement, especially for disco and pop genres. These results demonstrate that the feature-selection model effectively removed features
unsuitable for classication, and only 43% of the original features were necessary to obtain higher performance.
5.2.2. Local selection strategy using the lter approach
In this experiment, for the 10 genres, 45 one-against-one local feature sets were generated for classication, and the numbers of
features in each set are presented in Table 8. This table is symmetric and the number of features is considerably lower than that in the
global-feature set. Table 9 shows the classication results. The overall accuracy is approximately 92.2%, and is a 2.7% increase compared to the results presented in Table 7. These results demonstrate that, without considering other genres, features that are more
precise could be selected for each pair of genres to obtain better classication results compared with the results obtained using the
global-feature set. We also analyzed the features that are more frequently selected in each local feature set by using the lter approach
(Fig. 5). We found that rhythm occupied more than 70% of the selected features, which played a primary role in discriminating music
genres.
5.2.3. Local selection strategy using the wrapper approach
As mentioned in Section 4, the lter approach was used in our feature-selection model, in which the objective function was the
relative correlation. For applying this local-selection strategy, another approach called the wrapper approach was adopted in which
the error rate of classication was used as the objective function. Because a specic classier has a direct interaction with its selected
feature subset, the wrapper approach obtains a better solution than the lter approach does. In this study, the classication results
based on the wrapper approach were regarded as a benchmark for reference.

Table 12
Comparisons among all methods.
Y. Panagakis
et al. [31]

C. Kotropoulos
et al. [18]

C. Lee
et al. [22]

A.F. Arabi
et al. [1]

GTZAN
Timbre, pitch, temporal
768
NMPCA
58%
SVMs
84.3%

GTZAN
Timbre, pitch, temporal
768
LDA
25%
LDA
84.96%

GTZAN
Timbre

LDA

LDA
90.6%

GTZAN
Timbre, beat, chord

SVMs
90.79%

Y. Panagakis et al. [29]

Ours

Y. Panagakis et al. [30]

Ours

Dataset
Feature characteristics

GTZAN
Timbre, pitch, temporal

GTZAN
Timbre, pitch, temporal

Feature dim.
Dimensionality reduction

768
NMF

Percentage of remaining dim.


Classier
Accuracy

25%
SRC
91%

GTZAN
Intensity, pitch, timbre,
tonality, rhythm
265
Feature selection using SAHS
with the lter approach
31%
SVMs
92.2%

GTZAN
Intensity, pitch, timbre,
tonality, rhythm
265
Feature selection using SAHS
with the wrapper approach
55%
SVMs
97.2%

Dataset
Feature characteristics
Feature dim.
Dimensionality reduction
Percentage of remaining dim.
Classier
Accuracy

NMPCA: Non-negative multilinear principal component analysis.


LDA: linear discriminant analysis.
NMF: non-negative matrix factorization.
TPNTF: topology preserving non-negative tensor factorization.
SRC: sparse representation-based classier.

7680
TPNTF
1.8%
SRC
93.7%

Y.-F. Huang et al. / Data & Knowledge Engineering 92 (2014) 6076

73

Table 13
Number of features using the lter approach (with the artist lter).
Genre

Blues

Classical

Country

Disco

Hiphop

Jazz

Metal

Pop

Reggae

Rock

Blues
Classical
Country
Disco
Hiphop
Jazz
Metal
Pop
Reggae
Rock

104
*

98
94
*

112
106
99
*

94
91
95
99
*

99
99
87
112
92
*

98
104
97
93
88
99
*

82
95
88
102
98
116
106
*

91
96
92
96
103
103
91
101
*

112
104
104
96
104
104
95
75
96
*

* The asterisks indicate the diagonal line of a matrix.

In this experiment, 45 local feature sets were generated for classication, and the number of features in each set is presented in
Table 10. However, the numbers of features are considerably increased compared with those obtained using the lter approach.
This indicates that a slight gap remains between formula evaluation (i.e., relative correlation used in the lter approach) and real classication (i.e., the error rate of classication used in the wrapper approach). Regarding the classication results presented in Table 11,
the overall accuracy is approximately 97.2%, which is an increase of 5% compared to the results presented in Table 9. For specic classiers, the wrapper approach demonstrated higher performance than the lter approach; however, it was impractical for the general
feature-selection model because of the considerable computation time that was required for feature selection. We also analyzed the
features that were more frequently selected in each local feature set by using the wrapper approach (Fig. 6). The situation was similar
to that of the lter approach. Rhythm played the primary role in discriminating music genres.

5.3. Comparisons among all methods


Table 12 presents the qualitative comparisons between our methods and other novel methods. Various techniques of dimensionality reduction, such as NMPCA [31], LDA [15,18,22], NMF [21], and TPNTF [30], involve extracting features from varying feature characteristics. The various classiers were used in this study for music classication. The features from the modulation spectral analyses
of MFCC, OSC, NASE and the conventional bag-of-frames from low and high levels also present strong performance. Thus, we

Table 14
Confusion matrix using the lter approach (with the artist lter).

Predicted
Blues

Classical

Country

Disco

Hiphop

Jazz

Metal

Pop

Reggae

Rock

Recall

Actual
Blues

91

91%

Classical

93

95.9%

Country

70

15

76.9%

Disco

66

78.6%

Hiphop

76

89.4%

Jazz

72

87.8%

Metal

64

91.4%

Pop

72

93.5%

Reggae

66

82.5%

Rock

10

82

84.5%

87.5%

94.9%

82.4%

93%

97.4%

92.3%

94.1%

93.5%

78.6%

87.2%
Precision

68.3%
88.2%

74

Y.-F. Huang et al. / Data & Knowledge Engineering 92 (2014) 6076

Table 15
Number of features using the wrapper approach (with the artist lter).
Genre

Blues

Classical

Country

Disco

Hiphop

Jazz

Metal

Pop

Reggae

Rock

Blues
Classical
Country
Disco
Hiphop
Jazz
Metal
Pop
Reggae
Rock

153
*

146
128
*

150
156
146
*

152
160
135
158
*

152
146
145
164
133
*

145
163
146
136
153
146
*

159
149
160
142
160
141
153
*

150
145
131
150
155
137
166
144
*

142
135
148
148
152
157
163
154
161
*

* The asterisks indicate the diagonal line of a matrix.

employed the bag-of-frames in our experiments and combined them with a strong feature-selection model that was based on a localselection strategy to facilitate obtaining the rst and third highest performance levels among all methods.
5.4. Comparisons between without/with the Artist Filter (AF)
We also conducted the experiments with the artist lter. In these experiments, we referred to the analysis of the GTZAN dataset
[34] to lter out inappropriate audio recordings, and nally 863 audio recordings remain. The numbers of features and the confusion
matrix using the lter approach are presented in Table 13 and 14, whereas the numbers of features and the confusion matrix using the
wrapper approach are shown in Table 15 and 16. To compare the accuracy between the methods without/with AF, we present their
experimental results as shown in Table 17. Except the methods using the original feature set, the accuracy of the methods with AF is
lower than that without AF. This veries the statement Using songs from the same artist in both training and test sets leads to overoptimistic accuracy results. concluded by Flexer [9], although the different dataset (i.e., ISMIR 2004) was used there. Besides, we also
nd that the number of selected features used in the methods with AF is greater than that without AF. It could be due to more selected
features are required to identify the genres of test samples for only the 863 audio recordings. However, it has no signicant difference
for the wrapper approach, possibly because using the wrapper approach in the feature selection can nd out truly optimal local feature sets. Finally, as expected, the ranking of accuracy from low to high is still the original feature set, the global selection, the lter
approach, and the wrapper approach.
6. Conclusion
This paper proposed an effective framework for music genre classication. The original feature set extracted from intensity, pitch,
timbre, tonality, and rhythm is practical for music classication. By applying the SAHS algorithm to the original feature set, the feature-

Table 16
Confusion matrix using the wrapper approach (with the artist lter).

Predicted
Blues

Classical

Country

Disco

95

Classical

95

Country

Disco

Hiphop

Jazz

Hiphop

Jazz

Metal

Pop

Reggae

Rock

Recall

95%

97.9%

82

90.1%

75

89.3%

72

84.7%

77

93.9%

Metal

64

91.4%

Pop

72

93.5%

Reggae

72

90%

Rock

82

Actual
Blues

84.5%
91%

Precision

95%

96.9%

89.1%

91.5%

96%

96.3%

95.5%

92.3%

85.7%

76.6%
91.5%

Y.-F. Huang et al. / Data & Knowledge Engineering 92 (2014) 6076

75

Table 17
Comparisons between without/with AF.
Original feature (without AF) Global selection (without AF) Filter approach (without AF) Wrapper approach (without AF)
Percentage of remaining dim. 100%
Accuracy
84.3%
Original feature (with AF)
Percentage of remaining dim. 100%
Accuracy
86.3%

43%
89.5%

31%
92.2%

55%
97.2%

Global selection (with AF)

Filter approach (with AF)

Wrapper approach (with AF)

50%
86.4%

37%
87.1%

56%
91.1%

selection model effectively located the optimal feature subsets for corresponding music genres. Regarding the experimental results
obtained using the SVM classier, the local-selection strategy with the wrapper or lter approach presented higher performance
compared with the global-selection strategy. In other words, the classication accuracy is gradually improved using the different
strategies 1) the original feature set (84.3%), 2) the global selection (89.5%), 3) the local selection using the lter approach
(92.2%), and 4) the local selection using the wrapper approach (97.2%). The same ranking also happens to the case using the artist
lter. Thus, the local-selection models indeed derived more relevant features for music genre classication. In summary, the experimental results demonstrated that our method is more effective than other relevant methods.
Besides, the local feature selection using the SAHS algorithm could be also applied to various applications such as movie genre classication and painting genre classication. On the other hand, we are curious whether the local feature selection still performs better
than the global feature selection or no feature selection even when the genres of a to-be-classied dataset are few or not many.
References
[1] A.F. Arabi, G.J. Lu, Enhanced polyphonic music genre classification using high level features, Proc. the 1st IEEE International Conference on Signal and Image Processing Applications, 2009, pp. 101106.
[2] J.J. Aucouturier, F. Pachet, Representing musical genre: a state of the art, J. New Music Res. 32 (1) (2003) 8393.
[3] E. Benetos, C. Kotropoulos, Non-negative tensor factorization applied to music genre classification, IEEE Trans. Audio Speech Lang. Process. 18 (8) (2010)
19551967.
[4] B.E. Boser, I.M. Guyon, V.N. Vapnik, A training algorithm for optimal margin classifiers, Proc. the 5th Annual ACM Workshop on Computational Learning Theory,
1992, pp. 144152.
[5] S.F. Chang, T. Sikora, A. Puri, Overview of the MPEG-7 standard, IEEE Trans. Circ. Syst. Video Technol. 11 (6) (2001) 688695.
[6] C.C. Chang, C.J. Lin, LIBSVM: a library for support vector machines, software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm .
[7] M. Deriche, Feature selection using ant colony optimization, Proc. the 6th International Multi-conference on Systems, Signals, and Devices, 2009, pp. 14.
[8] R. Diao, Q. Shen, Two new approaches to feature selection with harmony search, Proc. IEEE Int. Conf. Fuzzy Syst. (2010) 17.
[9] A. Flexer, A closer look on artist filters for musical genre classification, Proc. the 10th International Society for Music Information Retrieval Conference, 2007.
[10] Z.W. Geem, J.H. Kim, G.V. Loganathan, A new heuristic optimization algorithm: harmony search, Simulation 76 (2) (2001) 6068.
[11] I. Guyon, J. Weston, S. Barnhill, V. Vapnik, Gene selection for cancer classification using support vector machines, Mach. Learn. 46 (13) (2002) 389422.
[12] I. Guyon, A. Elisseeff, An introduction to variable and feature selection, J. Mach. Learn. Res. 3 (2003) 11571182.
[13] M.A. Hall, L.A. Smith, Practical feature subset selection for machine learning, Proc. the 21st Australian Computer Science Conference, 1998, pp. 181191.
[14] Y.F. Huang, C.M. Wang, Self-adaptive harmony search algorithm for optimization, Expert Syst. Appl. 37 (4) (2010) 28262837.
[15] D.W. Jang, M.H. Jin, C.D. Yoo, Music genre classification using novel features and a weighted voting method, Proc. IEEE Int. Conf. Multimed. Expo (2008)
13771380.
[16] X. Jin, R. Bie, Random forest and PCA for self-organizing maps based automatic music genre discrimination, Proc. Int. Conf. Data Min. (2006) 414417.
[17] H.G. Kim, N. Moreau, T. Sikora, MPEG-7 Audio and Beyond: Audio Content Indexing and Retrieval, Wiley, 2005.
[18] C. Kotropoulos, G.R. Arce, Y. Panagakis, Ensemble discriminant sparse projections applied to music genre classification, Proc. the 20th International Conference
on, Pattern Recognition, 2010, pp. 822825.
[19] P.L. Lanzi, Fast feature selection with genetic algorithms: a filter approach, Proc. Int. Conf. Evol. Comput. (1997) 537540.
[20] O. Lartillot, P. Toiviainen, A Matlab toolbox for musical feature extraction from audio, Proc. the 10th International Conference on Digital Audio Effects,
2007, pp. 237244.
[21] D.D. Lee, H.S. Seung, Algorithms for non-negative matrix factorization, Advances in Neural Information Processing Systems, vol. 13, MIT press, 2001, pp. 556562.
[22] C.H. Lee, J.L. Shih, K.M. Yu, H.S. Lin, Automatic music genre classification based on modulation spectral analysis of spectral and cepstral features, IEEE Trans.
Multimed. 11 (4) (2009) 670682.
[23] M. Mahdavi, M. Fesanghary, E. Damangir, An improved harmony search algorithm for solving optimization problems, Appl. Math. Comput. 188 (2) (2007)
15671579.
[24] J.M. Martinez, R. Koenen, F. Pereira, MPEG-7: the generic multimedia content description standard, part 1, IEEE Multimed. 9 (2) (2002) 7887.
[25] J.M. Martinez, MPEG-7 overview (version 10), ISO/IEC JTC1/SC29/WG11 N6828, 2004.
[26] D. McEnnis, C. McKay, I. Fujinaga, P. Depalle, jAudio: a feature extraction library, Proc. the 6th International Conference on Music, Information Retrieval,
2005, pp. 600603.
[27] V. Mitra, C.J. Wang, A neural network based audio content classification, Proc. Int. Joint Conf. Neural Netw. (2007) 14941499.
[28] M.G. Omran, M. Mahdavi, Global-best harmony search, Appl. Math. Comput. 198 (2) (2008) 643656.
[29] Y. Panagakis, C. Kotropoulos, G.R. Arce, Music genre classification via sparse representations of auditory temporal modulations, Proc. the 17th European Signal
Processing Conference, 2009, pp. 15.
[30] Y. Panagakis, C. Kotropoulos, Music genre classification via topology preserving non-negative tensor factorization and sparse representations, Proc. the 35th IEEE
International Conference on Acoustics, Speech, and, Signal Processing, 2010, pp. 249252.
[31] Y. Panagakis, C. Kotropoulos, G.R. Arce, Non-negative multilinear principal component analysis of auditory temporal modulations for music genre classification,
IEEE Trans. Audio Speech Lang. Process. 18 (3) (2010) 576588.
[32] C.N. Silla, A.L. Koerich, C. Kaestner, Feature selection in automatic music genre classification, Proc. the 10th IEEE International Symposium on Multimedia, 2008,
pp. 3944.
[33] G. Silviu, Information Theory with Applications, McGraw-Hill, 1977.

76

Y.-F. Huang et al. / Data & Knowledge Engineering 92 (2014) 6076

[34] B.L. Sturm, An analysis of the GTZAN music genre dataset, Proc. the 2nd International ACM Workshop on Music Information Retrieval with User-centered and
Multimodal Strategies, 2012, pp. 712.
[35] G. Tzanetakis, P. Cook, Musical genre classification of audio signals, IEEE Trans. Speech Audio Process. 10 (5) (2002) 293302.
[36] L. Wang, R. Yanga, Y. Xua, Q. Niua, P.M. Pardalosb, M. Feia, An improved adaptive binary harmony search algorithm, Inf. Sci. 232 (2013) 5887.
[37] X. Yang, K. Wang, S.A. Shamma, Auditory representations of acoustic signals, IEEE Trans. Inf. Theory 38 (2) (1992) 824839.

Yin-Fu Huang received the B.S. degree in computer science from National Chiao-Tung University in 1979, and the M.S. and Ph.D. degrees in
computer science from National Tsing-Hua University in 1984 and 1988, respectively. He is currently a Professor in the Department of
Computer Science and Information Engineering, National Yunlin University of Science and Technology. Between July 1988 and July
1992, he was with Chung Shan Institute of Science and Technology as an Assistant Researcher. His research interests include database systems, multimedia systems, data mining, mobile computing, and bioinformatics.

Sheng-Min Lin received his B.S. degree in computer sciences from National Kaohsiung University of Applied Sciences in 2010. He is currently a master student in the Department of Computer Science and Information Engineering, National Yunlin University of Science and
Technology. His major areas of interests are multimedia systems and data mining.

Huan-Yu Wu received his B.S. degree in information and communication engineering from Chaoyang University of Technology in 2013. He
is currently a master student in the Department of Computer Science and Information Engineering, National Yunlin University of Science
and Technology. His major areas of interests are cluster computing, multimedia systems, and data mining.

Yu-Siou Li received his B.S. degree in computer sciences from National Formosa University and M.S. degree in computer sciences from National Yunlin University of Science and Technology in 2008 and 2011, respectively. He is currently serving in Qisda Co., Ltd. His major areas
of interests are database systems, multimedia systems, and data mining.

You might also like