Professional Documents
Culture Documents
a r t i c l e i n f o a b s t r a c t
Article history: Recently, classification tasks that naturally emerge in multi-label domains, such as text categorization,
Received 3 April 2012 automatic scene annotation, and gene function prediction, have attracted great interest. As in traditional
Available online 2 November 2012 single-label classification, feature selection plays an important role in multi-label classification. However,
recent feature selection methods require preprocessing steps that transform the label set into a single
Communicated by S. Sarkar
label, resulting in subsequent additional problems. In this paper, we propose a feature selection method
for multi-label classification that naturally derives from mutual information between selected features
Keywords:
and the label set. The proposed method was applied to several multi-label classification problems and
Multi-label feature selection
Multivariate feature selection
compared with conventional methods. The experimental results demonstrate that the proposed method
Multivariate mutual information improves the classification performance to a great extent and has proved to be a useful method in select-
Label dependency ing features for multi-label classification problems.
Ó 2012 Elsevier B.V. All rights reserved.
0167-8655/$ - see front matter Ó 2012 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.patrec.2012.10.005
350 J. Lee, D.-W. Kim / Pattern Recognition Letters 34 (2013) 349–357
detailed description of conventional feature selection methods that dently may not lead to better classification performance. They ar-
require a transformation method. In Section 3, we propose our gued that the best classification performance can be achieved by
multi-label feature filter criterion. To achieve this, we decompose using LP + v2, since the LP considers label correlations directly.
the calculation of high-dimensional entropy into a cumulative Although the LP is able to provide an intuitive way of transforming
sum of multivariate mutual information. The performance of the and takes into account relationships between labels, it suffers from
proposed method is investigated with several evaluation measures class size issues (Tsoumakas et al., 2011). If there are rarely ob-
for various multi-label data sets in Section 4. The discussion and served labels in the label set, then the LP creates too many classes,
conclusions are presented in Section 5. causing overfitting and imbalance problems (Sun et al., 2009).
Read (2008) proposed the Pruned Problem Transformation
(PPT) to improve the LP; patterns with too rarely occurring labels
2. Related work are simply removed from the training set by considering label sets
with a predefined minimum occurrence s. Doquire and Verleysen
Before reviewing conventional feature selection methods, we (2011) proposed a multi-label feature selection method using
introduce some basic notations for the multi-label learning. Let PPT to improve the classification performance of image annotation
W Rd denote an input space that is constructed from d features, and gene function classification. First, the multi-label data set is
and patterns drawn from W are assigned to a certain label subset transformed using the PPT method; next, a sequential forward
k # L, where L = fl1 ; . . . ; lt g is a finite set of labels with jLj = t. Thus, mul- selection is undertaken with the MI as the search criterion. Empir-
ti-label classification is the task of assigning unseen patterns to multi- ical results show that this gives better classification performance
ple labels. To solve the multi-label classification problem, an than PPT + v2 when multi-label k nearest neighbors is applied
algorithm should take into account many labels concurrently. Popular (Zhang and Zhou, 2007). They indicate that mutual information
multi-label learning algorithms first transform label sets into a single can be used as a good score measure for evaluating the dependen-
label (a process called problem transformation) and then solve the cies among features and labels, leading to good classification per-
resultant problem (Tsoumakas and Katakis, 2007; Yang and Pedersen, formance. However, since patterns could be discarded from
1997; Doquire and Verleysen, 2011). Similarly, feature selection steps original data set, this is an irreversible transformation, in which
in the problem transformation approach are as follows: (1) Transform there may be loss of class information. As a result, the performance
the original multi-label data set into a single-label data set. (2) Assess of learning algorithms may be limited, since the parameter s is
each feature independently using a score evaluation method such as generally unknown in practical situations.
mutual information (MI) or v2 statistics. (3) Select the predefined top The limitation in recent multi-label feature selection methods is
n features as input features for the multi-label classifier. We represent that they require a problem transformation method for evaluating
this process as problem transformation +score measure, a notation the dependency of given features. Since problem transformation
that will be used subsequently. converts the multi-label problem into single-label problem, this
Chen et al. (2007) proposed an Entropy-based Label Assignment process could cause subsequent problems. For example, if trans-
(ELA) that assigns weights to a multi-label pattern for different la- formed single-label is composed of too many classes, the perfor-
bels based on the label entropy. The ELA copies each pattern in mance of learning algorithm could be degraded. Moreover, if
accordance with the number of its labels, and then the inverse of information loss occurs in the transformation process, the feature
the number of its labels is assigned as the weight of each pattern. selection cannot take into account label relations. As a result, it is
So, each original pattern-labels pair ðP i ; kÞ is transformed to a set of important to develop a feature selection method that considers
1
patterns T i ¼ fðPi1 ; l1 Þ; . . . ; ðPijkj ; ljkj Þg with its weights jkj where multi-labels directly. Therefore, we investigate a mutual-informa-
1 6 i 6 jWj and lj 2 k. Since patterns with too many labels blurred tion-based feature selection method that does not require any prob-
out from the training phase owing to the assignment of low weight lem transformation. In the next section, we propose our multi-label
to this pattern, they argued that the learning algorithm can avoid feature selection method.
the overfitting problem originating from these patterns. Text cate-
gorization data sets were transformed by ELA, and three feature
3. Multivariate mutual information for multi-label feature
selection methods were then exhaustively applied to each trans-
selection
formed data set; two feature selection methods employed informa-
tion gain and v2 statistics as their score measure, and the other one
The feature selection problem is to select a subset S composed
used an optimal orthogonal centroid feature selection method.
of selected n features, from a set of features F (n < d), which jointly
Their empirical experiments indicate that any problem transfor-
have the largest dependency on L. To solve the feature selection
mation method yielding a loss of information about dependency
problem, we should find relevant features that contain as much
among labels may lead to poor classification performance, even
discriminating power about the output labels L as possible. In this
though the classification performance was improved by feature
section, we derive our multi-label multivariate filter criterion from
selection methods.
the equation of mutual information between feature set S and label
The Label Powerset (LP) is applied to music information retrie-
set L. The mutual information between selected feature subset S
val, specifically for recognizing six emotions that are simulta-
and label set L can be represented as follows:
neously evoked by a music clip (Trohidis et al., 2008). It
transforms a multi-label to a single-label by assigning each pat- IðS; LÞ ¼ HðSÞ HðS; LÞ þ HðLÞ
tern’s label set to a single class, so each pattern-labels pair ðPi ; kÞ
¼ Hðff1 ; . . . ; fn gÞ Hðff1 ; . . . ; fn ; l1 ; . . . ; lt gÞ þ Hðfl1 ; . . . ; lt gÞ
is transformed to ðPi ; ci Þ where ci 2 f0; 1gt , and 0 for lj R k while
1 for lj 2 k where 1 6 j 6 t. Suppose a pattern P i is assigned to ð1Þ
l1 ; l2 , and l5 simultaneously, then the transformed pattern-class Each HðÞ term of Eq. (1) represents the joint entropy of an arbi-
pair is represented as ðPi ; f1; 1; 0; 0; 1gÞ where t ¼ 5. The total num- trary number of variables, defined as
ber of classes is the total number of distinct label sets. v2 statistics X
is used to select effective features with the LP to improve the rec- HðXÞ ¼ PðXÞ log PðXÞ ð2Þ
ognition performance of the multi-labeled music emotions. The re-
sults indicate that a feature selection method that evaluates the where PðXÞ is a probabilistic mass function of given a set of vari-
dependency of each feature by considering each label indepen- ables X. The entropy is a measure for self-content of a variable
J. Lee, D.-W. Kim / Pattern Recognition Letters 34 (2013) 349–357 351
X
nþt
This notation is more useful when input set is a power set. Sup- HðS; LÞ ¼ ð1Þk V k ðS0k L00 Þ þ V k ðS00 L0k Þ
pose S ¼ ff1 ; f2 ; f3 g, then U 2 ðS0 Þ ¼ Hðf1 ; f2 Þ þ Hðf1 ; f3 Þ þ Hðf2 ; f3 Þ where k¼1
S0k ¼ feje 2 S0 ; jej ¼ kg. Let Y be a possible element drawn from X 0m !
X
nþt
k
X
k1
0 0
where m 6 k 6 n. By using Eq. (4), we obtain: ð1Þ V p ðSkp Lp Þ
0 1 k¼2 p¼1
X
n X
k
kþm @
XX X
n X
t
HðSÞ ¼ ð1Þ HðYÞA ð5Þ ¼ ð1Þk V k ðS0 Þ ð1Þk V k ðL0 Þ
k¼1 m¼1 X2S0k Y2X 0m k¼1 k¼1
X
nþt X
k1
Proof is provided in Appendix A. To transform the high-dimen- ð1Þk V p ðS0kp L0p Þ ð12Þ
sional joint entropy estimation into a series of k-dimensional joint k¼2 p¼1
entropy estimation problems, we decompose HðSÞ into a sum of the
pieces of multivariate mutual information using Eq. (5): We can rewrite the mutual information between two sets by
0 1 combining Eqs. (8), (9), and (12):
X
n X X
k X
HðSÞ ¼ ð1Þk @ ð1Þm HðY m ÞA IðS; LÞ ¼ HðSÞ þ HðLÞ HðS; LÞ
k¼1 X2S0k m¼1Y2X 0m X
n X
t X
n
! ¼ ð1Þk V k ðS0 Þ ð1Þk V k ðL0 Þ þ ð1Þk V k ðS0 Þ
X X X
¼ ð1ÞjXj ð1ÞjYj HðYÞ ¼ ð1ÞjXj IðfXgÞ ð6Þ k¼1 k¼1 k¼1
The present study focuses on developing a computationally effi- We experiment with three data sets from different applications;
cient algorithm to select a compact set of features. Thus, for com- bioinformatics, semantic scene analysis, and text categorization.
putational efficiency, we consider an approximated solution of Eq. The biological data set, Yeast, is concerned with gene function clas-
(13) by constraining the calculations of V k ðÞ functions with less sification of the Yeast Saccharomyces cerevisiae (Elisseeff and Wes-
than three cardinality. This can be obtained by replacing the limits ton, 2001). The gene functional classes made from the hierarchies
n þ t in the summations of Eq. (13) with three. By rewriting V k ðÞ that represent maximum 190 functions of genes. After preprocess-
functions into the multivariate mutual information terms, we ing, top four levels of hierarchies were chosen to compose the mul-
obtain: ti-labeled functions. The image data set, Scene, is concerned with
the semantic indexing of still scenes (Boutell et al., 2004). Each
eIðS; LÞ ¼ V 2 ðS0 L0 Þ V 3 ðS0 L0 Þ V 3 ðS0 L0 Þ
XX
1 1 2
XXX
1 1 2 scene possibly contains multiple objects such as desert, mountains,
¼ Iðffi ; lj gÞ Iðffi ; fj ; lk gÞ sea and so on. Thus, those objects can be directly used to compose
fi 2S lj 2L fi 2S fj 2S lk 2L the multi-label of each still scene. The text data set, Enron, is a sub-
XXX set of the Enron email corpus (Klimt and Yang, 2004). An email
Iðffi ; lj ; lk gÞ ð14Þ
fi 2S lj 2L lk 2L
may contain words according to its objective such as humor, admi-
ration, friendship, and so on. Thus, words and objectives of an
It is worth noting that if we want to obtain a more accurate va- email can be naturally encoded into the multi-label data set. Table
lue of the mutual information between S and L, then we should cal- 1 displays certain standard statistics of the data sets such as the
culate higher degree of relations, i.e., more than three, in Eq. (13); number of patterns, number of features, number of labels, and la-
however, it is computationally expensive because calculations of bel density (Tsoumakas and Katakis, 2007).
multivariate mutual information becomes prohibitive. From Eq. Both the Yeast and Scene data sets consist of continuous fea-
(14), we can easily derive our feature selection algorithm in the cir- tures. We discretized the Yeast and Scene data sets using the
cumstance of incremental selection. Suppose we have already se- Equal-width interval scheme to improve the computational effi-
lected a feature subset S; then we can see that to be selected ciency (Dougherty et al., 1995); each continuous feature was then
feature f þ should maximize eIðfS; f þ g; LÞ in the incremental selec- binarized into a categorical feature with two bins. Note that more
tion. Thus, the feature f þ in each step should maximize the follow- complex discretization schemes can be applied, or we can directly
ing equation: calculate the multivariate mutual information from continuous
features using multi-dimensional entropy estimation techniques
J ¼ eIðfS; f þ g; LÞ eIðS; LÞ (Beirlant et al., 1997; Miller, 2003; Lee, 2010).
X X X X X
¼ Iðffi ; lj gÞ Iðffi ; fj ; lk gÞ We compared our proposed method to three conventional
fi 2fS;f þ g lj 2L fi 2fS;f þ gfj 2fS;f þ g lk 2L methods: ELA + v2, PPT + v2, and PPT + MI. The classification per-
X XX XX formance of the four feature selection methods, including the pro-
Iðffi ; lj ; lk gÞ Iðffi ; lj gÞ
fi 2fS;f þ g lj 2L lk 2L fi 2S lj 2L
posed method, was measured using the multi-label naive bayes
XXX XXX (MLNB) classifier (Zhang et al., 2009). We evaluated the perfor-
þ Iðffi ; fj ; lk gÞ þ Iðffi ; lj ; lk gÞ mance of the methods using a 30% hold-out set. Specifically, 70%
fi 2S fj 2S lk 2L fi 2S lj 2L lk 2L
X XX XX of patterns randomly chosen from the data set were used for the
¼ Iðff þ ; li gÞ Iðff þ ; fi ; lj gÞ Iðff þ ; li ; lj gÞ ð15Þ training process, and the remaining 30% were used for measuring
li 2L fi 2S lj 2L li 2L lj 2L the performance of each feature selection method. Those experi-
ments were applied 30 times iteratively, and the average value
was taken to represent the classification performance. In the mul-
Algorithm 1. Proposed multi-label feature selection
ti-label classification problem, performance can be assessed by sev-
algorithm
eral evaluation measures. We employed four conventional
1: Input: n; . Number of to be selected features evaluation measures: Hamming loss, Ranking loss, Coverage, and
2: Output: S; . Selected feature subset multi-label accuracy (Boutell et al., 2004; Tsoumakas and Vlahavas,
3: Initialize S f/g and F ff1 ; . . . ; fd g; 2007). Let P ¼ fðP i ; ki Þj1 6 i 6 pg be a given test set where ki # L is a
4: repeat correct label subset, and Y i # L be a predicted label set corresponds
P
5: Find the feature f þ 2 F maximizing Eq. (15); to P i . The Hamming loss is defined to be hlossðPÞ ¼ 1p pi¼1 1t jki MY i j
6: Set S fS [ f þ g, and F F n S; where M denotes the symmetric difference between two sets. In
7: until jSj ¼ n most cases, a multi-label classifier could output the real-valued
8: Output the set S containing the selected features; likelihood yj between Pi and each label lj 2 L. The Ranking loss mea-
sures ranking quality of those likelihood, defined to as
P
rlossðPÞ ¼ 1p pi¼1 jk 1jjk j fðy1 ; y2 Þjy1 6 y2 ; ðy1 ; y2 Þ 2 ki ki g where ki
i i
Given a set of already selected features, the algorithm chooses denotes the complementary set of ki . Moreover, those likelihood
the next features as the one that maximizes the criterion J under can be ranked according to its value, for example, if y1 > y2 then
incremental selection. The proposed algorithm can be described rankðy1 Þ < rankðy2 Þ. Then the Coverage can be defined as
P
by the Algorithm 1. Note that this is a greedy algorithm; even cov ðPÞ ¼ 1p pi¼1 maxy2ki rankðyÞ 1. In addition, the multi-label
though it is not guaranteed to find the global maximum value of
J, it provides a computationally efficient performance for various
applications.
Table 1
Brief description of multi-label data sets.
4. Experimental results
Name Domain Patterns Features Labels Density
In this section, we verify the effectiveness of the proposed Scene Image 2407 294 6 0.179
Enron Text 1702 1001 53 0.064
method by comparing its performance against conventional mul-
Yeast Biology 2417 103 14 0.303
ti-label feature selection methods.
J. Lee, D.-W. Kim / Pattern Recognition Letters 34 (2013) 349–357 353
1
Pp jki \Y i j
accuracy is defined to be mlaccðPÞ ¼ p i¼1 jki [Y i j . The Hamming conventional methods for any size of selected feature subset.
loss evaluates how many times an pattern-label pair is misclassi- Fig. 1(a) shows that the Hamming loss of the proposed method im-
fied, and other three measures concern the ranking quality of differ- proved with the size of the selected feature subset. However, the
ent labels for each test pattern. The first three evaluation measures Hamming loss of other conventional methods rapidly degraded
indicate good classification performance when evaluated as low as the size of the selected feature subset ranged from 1 to 10,
values, whereas the last evaluation measure, multi-label accuracy, and then, the performance slowly improved as the size of the fea-
indicates good classification performance when the classifier ture subset grew larger. The Hamming loss of the feature subsets
achieves high values. These four evaluation measures demonstrate selected using ELA + v2, PPT + v2, and PPT + MI were 0.3344,
different aspect of multi-label classification performance. 0.3356, and 0.3316, respectively, with 20 features selected,
whereas the Hamming loss of the selected feature subset according
4.2. Comparison results to the proposed method was 0.1394. Thus, we can see that the per-
formance of the proposed method showed an improvement of
Fig. 1 shows the classification performance of each feature 0.1962 over the conventional PPT + v2. Fig. 1(b) shows that the
selection method for the Scene data set. The horizontal axis repre- Ranking loss was improved to a great extent by using the proposed
sents the size of the selected feature subset according to each method, and this tendency was consistent with the comparison re-
feature selection method, and the vertical axis indicates the classi- sults of the Coverage and multi-label accuracy, as shown in
fication performance of certain evaluation measures. The proposed Fig. 1(c) and (d). The Ranking loss of the feature subsets selected
method showed superior classification performance to other using the proposed method was 0.1344 when 20 features were
Experimental Results on Scene Data set Experimental Results on Scene Data set
0.3
0.34
ELA+CHI+MLNB
0.32 0.28 PPT+CHI+MLNB
PPT+MI+MLNB
0.3 0.26 Proposed+MLNB
0.28
Hamming Loss
0.24
Ranking Loss
0.26
0.22
0.24
ELA+CHI+MLNB 0.2
0.22 PPT+CHI+MLNB
PPT+MI+MLNB 0.18
0.2
Proposed+MLNB
0.18 0.16
0.16
0.14
0.14
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50
Number of Input Features Number of Input Features
Experimental Results on Scene Data set Experimental Results on Scene Data set
0.5
ELA+CHI+MLNB
2.5
PPT+CHI+MLNB 0.45
PPT+MI+MLNB
2.4
Proposed+MLNB 0.4
Multi−label Accuracy
2.3 0.35
Coverage
2.2 0.3
0.25
2.1
0.2
2
0.15
ELA+CHI+MLNB
1.9 0.1 PPT+CHI+MLNB
PPT+MI+MLNB
1.8 0.05 Proposed+MLNB
0
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50
Fig. 1. Classification performance of the Scene data set according to feature subsets using the proposed method and three conventional feature selection methods: PPT + v2,
PPT + MI, and ELA + v2.
354 J. Lee, D.-W. Kim / Pattern Recognition Letters 34 (2013) 349–357
selected, whereas the Ranking loss of the feature subsets selected conventional methods. The Hamming loss of the feature subsets
using ELA + v2, PPT + v2, and PPT + MI were 0.2369, 0.2370, and selected using ELA + v2, PPT + v2, and PPT + MI were 0.0976,
0.2344, respectively. Thus, we can conclude that the proposed 0.0930, and 0.0949, respectively, for 20 selected features,
method selected a more effective feature subset than the other whereas the Hamming loss of the selected feature subset of
conventional feature selection methods. The best classification the proposed method was 0.0631. Fig. 2(b) shows that the four
performance evaluated by Hamming loss, Ranking loss, Coverage, feature selection methods started with a similar Ranking loss
and multi-label accuracy were all achieved using the proposed value until 15 features were selected. However, with the grow-
method, with scores of 0.1394, 0.1280, 1.7296, and 0.5348 with ing size of the selected feature subset, our proposed method
different sizes of selected feature subsets. showed better performance than three other methods. The
Fig. 2 shows the classification performance of each feature Ranking loss of the feature subsets in accordance with the pro-
selection method for the Enron data set. The proposed method posed method was 0.1251 when 20 features were selected,
shows better classification performance compared to other con- whereas ELA + v2, PPT + v2, and PPT + MI had achieved a Rank-
ventional methods in terms of Hamming loss. Fig. 2(a) shows ing loss of 0.1453, 0.1457, and 0.1400, respectively, for the same
that our proposed method demonstrated superior performance size of feature subsets. We can see a similar tendency in
compared to other conventional methods in the Hamming loss Fig. 2(c) and (d). Hence, our proposed method has shown its
experiment. For any size of the selected feature subsets, the superiority compared to the other three conventional methods
proposed method showed better performance than the other in the classification of the Enron data set.
Experimental Results on Enron Data set Experimental Results on Enron Data set
0.14 0.17
ELA+CHI+MLNB ELA+CHI+MLNB
0.13 PPT+CHI+MLNB PPT+CHI+MLNB
PPT+MI+MLNB 0.16 PPT+MI+MLNB
Proposed+MLNB Proposed+MLNB
0.12
Hamming Loss
0.11 0.15
Ranking Loss
0.1
0.14
0.09
0.13
0.08
0.07
0.12
0.06
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50
Number of Input Features Number of Input Features
Experimental Results on Enron Data set Experimental Results on Enron Data set
20.5 0.32
ELA+CHI+MLNB
20 PPT+CHI+MLNB
PPT+MI+MLNB 0.3
19.5 Proposed+MLNB
Multi−label Accuracy
0.28
19
Coverage
0.26
18.5
18 0.24
17.5
0.22
17 ELA+CHI+MLNB
0.2 PPT+CHI+MLNB
16.5 PPT+MI+MLNB
0.18 Proposed+MLNB
16
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50
Fig. 2. Classification performance of the Enron data set according to feature subsets using the proposed method and three conventional feature selection methods: PPT + v2,
PPT + MI, and ELA + v2.
J. Lee, D.-W. Kim / Pattern Recognition Letters 34 (2013) 349–357 355
Experimental Results on Yeast Data set Experimental Results on Yeast Data set
0.24
0.27 ELA+CHI+MLNB ELA+CHI+MLNB
PPT+CHI+MLNB 0.235 PPT+CHI+MLNB
0.265
PPT+MI+MLNB PPT+MI+MLNB
0.26 Proposed+MLNB 0.23 Proposed+MLNB
Hamming Loss
0.255 0.225
Ranking Loss
0.25
0.22
0.245
0.215
0.24
0.21
0.235
0.23 0.205
0.225 0.2
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50
Number of Input Features Number of Input Features
Experimental Results on Yeast Data set Experimental Results on Yeast Data set
0.46
8.4
0.44
8.3
Multi−label Accuracy
8.2 0.42
Coverage
ELA+CHI+MLNB
8.1 PPT+CHI+MLNB
0.4
PPT+MI+MLNB
Proposed+MLNB
8
0.38
7.9 ELA+CHI+MLNB
PPT+CHI+MLNB 0.36
7.8 PPT+MI+MLNB
Proposed+MLNB
7.7 0.34
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50
Number of Input Features Number of Input Features
Fig. 3. Classification performance of the Yeast data set according to feature subsets using the proposed method and three conventional feature selection methods: PPT + v2,
PPT + MI, and ELA + v2.
Fig. 3 shows the classification performance of each feature selec- and multi-label accuracy were similar to the Ranking loss results, as
tion method for the Yeast data set. The comparison results for the shown in Fig. 3(c) and (d). Although the proposed method showed
Hamming loss are shown in Fig. 3(a). The proposed method always better performance than other methods, the gain in classification
showed better Hamming loss compared to the other three conven- performance was not large enough because the dependency of each
tional methods. The Hamming loss of the feature subsets selected feature in Yeast data set is similar to each other.
using ELA + v2, PPT + v2, and PPT + MI were 0.2494, 0.2390, and
0.2397, respectively, for 20 selected features, whereas the Ham-
ming loss of the selected feature subset of the proposed method 5. Conclusions
was 0.2273. Thus, the proposed method shows better classification
performance than other conventional methods in terms of Ham- In this paper, we presented a multivariate mutual information-
ming loss. The Ranking loss of the proposed methods was shown based feature selection method for multi-label classification. Our pro-
to have better performance when the size of the selected features posed method does not rely on any problem transformation method
subset ranged from 15 to 30. The Ranking loss of the feature subsets to select a relevant feature subset. To efficiently evaluate the depen-
in accordance with the proposed method was 0.2047 when 20 fea- dency of input features in multivariate situations, the proposed meth-
tures were selected, whereas ELA + v2, PPT + v2, and PPT + MI had od calculates three-dimensional interactions among features and
achieved 0.2171, 0.2148, and 0.2168, respectively, with the same labels instead of the calculating prohibitive high-dimensional density
size of feature subsets. The comparison results of both the Coverage estimations. Our comprehensive experiments demonstrate that the
356 J. Lee, D.-W. Kim / Pattern Recognition Letters 34 (2013) 349–357
P P
classification performance can be significantly improved by the pro- represented by a linear function of Eq. (4). Let X2S0k Y2X 0m HðYÞ =
posed method. Comparison results on three real-world data sets ak;m U m ðS0 Þ; then, Eq. (5) can be represented as:
emerging from different domains show the advantage of the pro- 0 1
posed method compared with the three conventional methods based X
n X
k
kþm @
XX
ð1Þ HðYÞA
on three problem transformation methods and two score measures in k¼1 m¼1 X2S0k Y2X 0m
terms of four multi-label performance measures: the Hamming loss,
Ranking loss, Coverage, and multi-label accuracy. Thus, we showed
n X
X k
¼ ð1Þkþm ak;m U m ðS0 Þ where m 6 k ð17Þ
that the proposed method can find very effective feature subsets for
k¼1 m¼1
the multi-label classification problem.
Future work should include the study of the influence of the The coefficient ak;m is determined by the generator. First, X is
approximation accuracy; because the proposed method only con- chosen from S0k , and the number of possible X 2 S0k can be repre-
siders three-dimensional interactions among features and labels, n
sented as . In addition, Y is generated from the element of
it may lose important information with higher-order label depen- k
dency. However, this leads to two practical difficulties: high- k
X 0m with cardinality m, and thus, we simply represent it as ,
dimensional density estimation with limited size of training pat- m
terns and expensive computational cost of high-order multivariate n
where 1 6 m 6 k. Finally, the term U m ðS0 Þ is composed of
mutual information. As a future work, we would like to study this m
issue further. of the entropy terms. Thus, we can formalize the coefficient ak;m as:
n k
Appendix A
k m ðn mÞ!
ak;m ¼ ¼ ð18Þ
n ðn kÞ!ðk mÞ!
Eq. (5) indicates that the entropy of a variable set is equal to the
sum of entropy of the power set of a set S with Möbius inversion. It m
is easy to imagine that if we consider the Pascal’s Triangle, in which nm
Further, we can simplify the ak;m as , where m 6 k 6 n.
each element is determined by sum of two elements in its upper km
row. Suppose we take the values in a row except first row. In addi- Thus, by combining Eqs. (17) and (18) using the Pascal’s Triangle,
tion, if values located in even position at respective rows take posi- we obtain:
tive, and values located in odd position take negative, then sum of Xn X k Xn X k
nm
these values should be zero. The proof of Eq. (5) uses this property; ð1Þkþm ak;m U m ðS0 Þ ¼ ð1Þkþm U m ðS0 Þ ð19Þ
km
a sum of entropy of the power set of a set S with Möbius inversion k¼1 m¼1 k¼1 m¼1
Appendix B Chen, W., Yan, J., Zhang, B., Chen, Z., Yang, Q., 2007. Document transformation for
multi-label feature selection in text categorization. In: Proc. Seventh IEEE
Internat Conf. of Data Mining (ICDM’07), pp. 451–456.
This relation can be easily confirmed using properties of the Diplaris, S., Tsoumakas, G., Mitkas, P., Vlahavas, I., 2005. Protein classification with
power set. Let S be a set of n variables, and S0 denotes the power multiple algorithms. Adv. Inf. 3746, 448–456.
Doquire, G., Verleysen, M., 2011. Feature selection for multi-label classification
set of S. Suppose we add a set of t variables L. Then fS; Lg0 can be
problems. Adv. Comput. Intell. 6691, 9–16.
represented as: Dougherty, J., Kohavi, R., Sahami, M., 1995. Supervised and unsupervised
discretization of continuous features. In: Internat. Worksh. Conf. on Machine
fS; Lg0 ¼ fS0 L0 g ¼ fS00 L00 ; S00 L01 ; . . . ; S0n L0t g ð23Þ Learning. Morgan Kaufmann Publishers, Inc., pp. 194–202.
Elisseeff, A., Weston, J., 2001. A kernel method for multi-labelled classification. Adv.
where denotes the cartesian product between two sets. Let us Neural Inf. Process. Systems 14, 681–687.
Gu, Q., Li, Z., Han, J., 2011. Correlated multi-label feature selection. In: Proc. 20th
illustrate a situation of including new variables to Eq. (8). Since a ACM Internat. Conf. on Information and Knowledge Management. ACM, pp.
set of variables are newly included, we should consider additional 1087–1096.
relations among them. This can be written as: Guyon, I., Elisseeff, A., 2003. An introduction to variable and feature selection. J.
Machine Learn. Res. 3, 1157–1182.
Xn X
t Klimt, B., Yang, Y., 2004. The enron corpus: A new dataset for email classification
HðS; LÞ ¼ ð1Þkþm V kþm ðS0k L0m Þ ð24Þ research. Lect. Note Comput. Sci. 3201, 217–226.
Lee, I., 2010. Sample-spacings-based density and entropy estimators for spherically
k¼0 m¼0
invariant multidimensional data. Neural Comput. 22, 2208–2227.
Lewis, D., Yang, Y., Rose, T., Li, F., 2004. Rcv1: A new benchmark collection for text
categorization research. J. Machine Learn. Res. 5, 361–397.
Eq. (24) can be more simplified using V kþm ðÞ function. For McGill, W., 1954. Multivariate information transmission. IRE Trans. Inf. Theory 4,
example, V 0þ2 ðS00 L02 Þ þ V 1þ1 ðS01 L01 Þ þ V 2þ0 ðS02 L00 Þ can be repre- 93–111.
Miller, E., 2003. A new class of entropy estimators for multi-dimensional densities.
sented as V 2 ðS0 L0 Þ. Thus we can rewrite Eq. (24) as follows:
In: Proc. 2003 IEEE Internat. Conf. on Acoustic, Speech and Signal Processing
(ICASSP’03). IEEE, pp. 297–300.
X
nþt
Read, J., 2008. A pruned problem transformation method for multi-label
HðS; LÞ ¼ ð1Þk V k ðS0 L0 Þ ð25Þ classification. In: Proc. 2008 New Zealand Comput. Sci. Res. Stud. Conf.
k¼0 (NZCSRS’08), pp. 143–150.
Saeys, Y., Inza, I., Larrañaga, P., 2007. A review of feature selection techniques in
Since V 0 ðÞ ¼ 0 and fS0 L0 g ¼ fS; Lg0 respectively, Eq. (25) can bioinformatics. Bioinformatics 23, 2507–2517.
be rewritten as: Schapire, R., Singer, Y., 2000. Boostexter: A boosting-based system for text
categorization. Machine Learn. 39, 135–168.
X
nþt Sebastiani, F., 2002. Machine learning in automated text categorization. ACM
HðS; LÞ ¼ ð1Þk V k ðfS; Lg0 Þ ð26Þ Comput. Surv. 34, 1–47.
k¼1 Sun, Y., Wong, A., Kamel, M., 2009. Classification of imbalanced data: A review. Int. J.
Pattern Recognition Artif. Intell. 23, 687.
Trohidis, K., Tsoumakas, G., Kalliris, G., Vlahavas, I., 2008. Multilabel classification of
music into emotions. In: Proc. Ninth Internat. Conf. Music Inform. Retrieval
Acknowledgement (ISMIR’08), Philadelphia, PA, USA.
Tsoumakas, G., Katakis, I., 2007. Multi-label classification: An overview. Internat. J.
This research was supported by Basic Science Research Program Data Warehouse Min. 3, 1–13.
Tsoumakas, G., Katakis, I., Vlahavas, I., 2011. Random k-labelsets for multi-label
through the National Research Foundation of Korea (NRF) funded classification. IEEE Trans. Knowl. Data Eng. 23, 1079–1089.
by the Ministry of Education, Science and Technology (2012- Tsoumakas, G., Vlahavas, I., 2007. Random k-labelsets: An ensemble method for
0001772). multilabel classification. Machine Learn. (ECML’07) 4701, 406–417.
Watanabe, S., 1969. Knowing and Guessing: A Quantitative Study of Inference and
Information. Wiley, New York.
References Yang, Y., Pedersen, J., 1997. A comparative study on feature selection in text
categorization. In: Proc. 14th Internat. Conf. on Machine Learning, pp. 412–420.
Beirlant, J., Dudewicz, E., Györfi, L., Van der Meulen, E., 1997. Nonparametric Zhang, M., Peña, J., Robles, V., 2009. Feature selection for multi-label naive bayes
entropy estimation: An overview. Internat. J. Math. Statist. Sci. 6, 17–40. classification. Inf. Sci. 179, 3218–3229.
Boutell, M., Luo, J., Shen, X., Brown, C., 2004. Learning multi-label scene Zhang, M., Zhou, Z., 2007. ML-KNN: A lazy learning approach to multi-label
classification. Pattern Recognition 37, 1757–1771. learning. Pattern Recognition 40, 2038–2048.