You are on page 1of 9

Pattern Recognition Letters 34 (2013) 349–357

Contents lists available at SciVerse ScienceDirect

Pattern Recognition Letters


journal homepage: www.elsevier.com/locate/patrec

Feature selection for multi-label classification using multivariate


mutual information
Jaesung Lee, Dae-Won Kim ⇑
School of Computer Science and Engineering, Chung-Ang University, 221, Heukseok-Dong, Dongjak-Gu, Seoul 156-756, Republic of Korea

a r t i c l e i n f o a b s t r a c t

Article history: Recently, classification tasks that naturally emerge in multi-label domains, such as text categorization,
Received 3 April 2012 automatic scene annotation, and gene function prediction, have attracted great interest. As in traditional
Available online 2 November 2012 single-label classification, feature selection plays an important role in multi-label classification. However,
recent feature selection methods require preprocessing steps that transform the label set into a single
Communicated by S. Sarkar
label, resulting in subsequent additional problems. In this paper, we propose a feature selection method
for multi-label classification that naturally derives from mutual information between selected features
Keywords:
and the label set. The proposed method was applied to several multi-label classification problems and
Multi-label feature selection
Multivariate feature selection
compared with conventional methods. The experimental results demonstrate that the proposed method
Multivariate mutual information improves the classification performance to a great extent and has proved to be a useful method in select-
Label dependency ing features for multi-label classification problems.
Ó 2012 Elsevier B.V. All rights reserved.

1. Introduction the performance of multi-label classification while preserving the


inherent meaning of given features.
Multi-label classification is a challenging problem that emerges To select a set of relevant features from given data set, some
in several modern applications such as text categorization, gene multi-label feature selection algorithms optimize a set of parame-
function classification, and semantic annotation of images (Scha- ters during feature selection process to tune the kernel function of
pire and Singer, 2000; Sebastiani, 2002; Lewis et al., 2004; Diplaris multi-label classifier (Gu et al., 2011). However, it frequently
et al., 2005; Boutell et al., 2004). As in the traditional classification encounters exhaustive calculations to find an appropriate hyper-
problem, the performance of multi-label classification is strongly space using pairwise comparisons of patterns. This process should
influenced by the quality of input features. Theoretically, a pattern be done in each iterative feature selection step, so it is impractical
may lose its distinguishment owing to the irrelevant or redundant in the viewpoint of computational cost. There is another way of
features since the similarity of each pair of patterns in same class treating multi-label learning; this approach converts the multi-la-
can be decreased (Watanabe, 1969). These features could cause bel problems into traditional single-label multi-class problem, and
additional problems of confusing the learning algorithm and lead- then each feature is evaluated in terms of dependency to trans-
ing to poor classification performance (Guyon and Elisseeff, 2003; formed new single-label (Chen et al., 2007; Trohidis et al., 2008).
Saeys et al., 2007). This is the most simple approach and provides a connection be-
Consequently, most recent research concerned with multi-label tween single-label learning researches and novel multi-label learn-
classification naturally employed feature selection techniques ing. However, it causes subsequent problems, since multiple labels
(Yang and Pedersen, 1997; Chen et al., 2007; Doquire and Verley- are transformed to a single label, so that newly created label inher-
sen, 2011; Trohidis et al., 2008). The feature selection is a task of ently contains too many classes, leading to difficulty of learning
selecting relevant features directly to preserve the internal mean- (Read, 2008).
ing of given features as it is. This is an important constraint in some In this paper, we propose a mutual information based multi-la-
applications; for example, the task of gene function classification bel feature selection criterion. The characteristic of our proposed
considers the classification accuracy as well as the biological anal- method is that it does not involve any type of transformation
ysis of the selected features (Diplaris et al., 2005). In the present method – it selects an effective feature subset by maximizing the
study, we focused on the feature selection approach to improve dependency between selected features and labels. To the best of
our knowledge, it is the first time of proposing a feature filter cri-
terion that takes into account label interactions in evaluating the
⇑ Corresponding author. Tel.: +82 2 820 5304. dependency of given features without resorting to problem trans-
E-mail address: dwkim@cau.ac.kr (D.-W. Kim). formation. This paper is organized as follows: Section 2 gives a

0167-8655/$ - see front matter Ó 2012 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.patrec.2012.10.005
350 J. Lee, D.-W. Kim / Pattern Recognition Letters 34 (2013) 349–357

detailed description of conventional feature selection methods that dently may not lead to better classification performance. They ar-
require a transformation method. In Section 3, we propose our gued that the best classification performance can be achieved by
multi-label feature filter criterion. To achieve this, we decompose using LP + v2, since the LP considers label correlations directly.
the calculation of high-dimensional entropy into a cumulative Although the LP is able to provide an intuitive way of transforming
sum of multivariate mutual information. The performance of the and takes into account relationships between labels, it suffers from
proposed method is investigated with several evaluation measures class size issues (Tsoumakas et al., 2011). If there are rarely ob-
for various multi-label data sets in Section 4. The discussion and served labels in the label set, then the LP creates too many classes,
conclusions are presented in Section 5. causing overfitting and imbalance problems (Sun et al., 2009).
Read (2008) proposed the Pruned Problem Transformation
(PPT) to improve the LP; patterns with too rarely occurring labels
2. Related work are simply removed from the training set by considering label sets
with a predefined minimum occurrence s. Doquire and Verleysen
Before reviewing conventional feature selection methods, we (2011) proposed a multi-label feature selection method using
introduce some basic notations for the multi-label learning. Let PPT to improve the classification performance of image annotation
W  Rd denote an input space that is constructed from d features, and gene function classification. First, the multi-label data set is
and patterns drawn from W are assigned to a certain label subset transformed using the PPT method; next, a sequential forward
k # L, where L = fl1 ; . . . ; lt g is a finite set of labels with jLj = t. Thus, mul- selection is undertaken with the MI as the search criterion. Empir-
ti-label classification is the task of assigning unseen patterns to multi- ical results show that this gives better classification performance
ple labels. To solve the multi-label classification problem, an than PPT + v2 when multi-label k nearest neighbors is applied
algorithm should take into account many labels concurrently. Popular (Zhang and Zhou, 2007). They indicate that mutual information
multi-label learning algorithms first transform label sets into a single can be used as a good score measure for evaluating the dependen-
label (a process called problem transformation) and then solve the cies among features and labels, leading to good classification per-
resultant problem (Tsoumakas and Katakis, 2007; Yang and Pedersen, formance. However, since patterns could be discarded from
1997; Doquire and Verleysen, 2011). Similarly, feature selection steps original data set, this is an irreversible transformation, in which
in the problem transformation approach are as follows: (1) Transform there may be loss of class information. As a result, the performance
the original multi-label data set into a single-label data set. (2) Assess of learning algorithms may be limited, since the parameter s is
each feature independently using a score evaluation method such as generally unknown in practical situations.
mutual information (MI) or v2 statistics. (3) Select the predefined top The limitation in recent multi-label feature selection methods is
n features as input features for the multi-label classifier. We represent that they require a problem transformation method for evaluating
this process as problem transformation +score measure, a notation the dependency of given features. Since problem transformation
that will be used subsequently. converts the multi-label problem into single-label problem, this
Chen et al. (2007) proposed an Entropy-based Label Assignment process could cause subsequent problems. For example, if trans-
(ELA) that assigns weights to a multi-label pattern for different la- formed single-label is composed of too many classes, the perfor-
bels based on the label entropy. The ELA copies each pattern in mance of learning algorithm could be degraded. Moreover, if
accordance with the number of its labels, and then the inverse of information loss occurs in the transformation process, the feature
the number of its labels is assigned as the weight of each pattern. selection cannot take into account label relations. As a result, it is
So, each original pattern-labels pair ðP i ; kÞ is transformed to a set of important to develop a feature selection method that considers
1
patterns T i ¼ fðPi1 ; l1 Þ; . . . ; ðPijkj ; ljkj Þg with its weights jkj where multi-labels directly. Therefore, we investigate a mutual-informa-
1 6 i 6 jWj and lj 2 k. Since patterns with too many labels blurred tion-based feature selection method that does not require any prob-
out from the training phase owing to the assignment of low weight lem transformation. In the next section, we propose our multi-label
to this pattern, they argued that the learning algorithm can avoid feature selection method.
the overfitting problem originating from these patterns. Text cate-
gorization data sets were transformed by ELA, and three feature
3. Multivariate mutual information for multi-label feature
selection methods were then exhaustively applied to each trans-
selection
formed data set; two feature selection methods employed informa-
tion gain and v2 statistics as their score measure, and the other one
The feature selection problem is to select a subset S composed
used an optimal orthogonal centroid feature selection method.
of selected n features, from a set of features F (n < d), which jointly
Their empirical experiments indicate that any problem transfor-
have the largest dependency on L. To solve the feature selection
mation method yielding a loss of information about dependency
problem, we should find relevant features that contain as much
among labels may lead to poor classification performance, even
discriminating power about the output labels L as possible. In this
though the classification performance was improved by feature
section, we derive our multi-label multivariate filter criterion from
selection methods.
the equation of mutual information between feature set S and label
The Label Powerset (LP) is applied to music information retrie-
set L. The mutual information between selected feature subset S
val, specifically for recognizing six emotions that are simulta-
and label set L can be represented as follows:
neously evoked by a music clip (Trohidis et al., 2008). It
transforms a multi-label to a single-label by assigning each pat- IðS; LÞ ¼ HðSÞ  HðS; LÞ þ HðLÞ
tern’s label set to a single class, so each pattern-labels pair ðPi ; kÞ
¼ Hðff1 ; . . . ; fn gÞ  Hðff1 ; . . . ; fn ; l1 ; . . . ; lt gÞ þ Hðfl1 ; . . . ; lt gÞ
is transformed to ðPi ; ci Þ where ci 2 f0; 1gt , and 0 for lj R k while
1 for lj 2 k where 1 6 j 6 t. Suppose a pattern P i is assigned to ð1Þ
l1 ; l2 , and l5 simultaneously, then the transformed pattern-class Each HðÞ term of Eq. (1) represents the joint entropy of an arbi-
pair is represented as ðPi ; f1; 1; 0; 0; 1gÞ where t ¼ 5. The total num- trary number of variables, defined as
ber of classes is the total number of distinct label sets. v2 statistics X
is used to select effective features with the LP to improve the rec- HðXÞ ¼  PðXÞ log PðXÞ ð2Þ
ognition performance of the multi-labeled music emotions. The re-
sults indicate that a feature selection method that evaluates the where PðXÞ is a probabilistic mass function of given a set of vari-
dependency of each feature by considering each label indepen- ables X. The entropy is a measure for self-content of a variable
J. Lee, D.-W. Kim / Pattern Recognition Letters 34 (2013) 349–357 351

set, on the contrary, the mutual information focuses on shared X


n

information between variables. Note that if given variables are com-


HðSÞ ¼  ð1Þk V k ðS0 Þ ð8Þ
k¼1
posed of continuous variables, we can use either the differential en-
tropy to obtain the entropy of continuous variables, or the As a result, HðSÞ is represented by the right-hand side of Eq. (8),
preprocessing (discretization) method to transform continuous and the joint entropy of labels, HðLÞ, where jLj ¼ t, can be repre-
variables into discrete (categorical) counterparts. For the sake of sented as:
simplicity, we present the key notion of the proposed method by
X
t
using categorical variables, in which each variable have finite num- HðLÞ ¼  ð1Þk V k ðL0 Þ ð9Þ
ber of categories or discretized values. k¼1
The entropy term requires a high-dimensional probability esti-
mation of the given variables. However, this is computationally too
expensive and also too hard to estimate accurately owing to the 3.2. Decomposing the joint entropy of two sets: HðS; LÞ
limited amount of training data. Therefore, we try to approximate
the high-dimensional joint entropy term by a series of practically We can decompose the joint entropy of the two sets using Eq.
computable terms. We first rewrite the high-dimensional joint en- (5):
tropy term as a sum of a series of multivariate mutual information
terms (a process we refer to as Decompose), and then we approxi- X
nþt
HðS; LÞ ¼  ð1Þk V k ðfS; Lg0 Þ ð10Þ
mate it with a view to computational efficiency. The multivariate k¼1
mutual information of a given variable set T can be defined by
information theory (McGill, 1954): Detailed derivation is provided in Appendix B. Further, we can
X divide the power set fS; Lg0 in the right hand side of Eq. (10) into
IðfTgÞ ¼  ð1ÞjXj HðXÞ ð3Þ three parts using power-set theorem; the first part has variable
X2T 0 sets from S0  L00 , the second part has variable sets from S00  L0 ,
P and third part is composed of remained subsets where  denotes
Note that X2T 0 represents a sum over all elements X drawn the cartesian product of two sets. For example, V 3 ðfS; Lg0 Þ can be
from T 0 , and T 0 denotes the power set of T. Suppose T ¼ ff1 ; f2 g, then
represented as a sum over these three parts; the first part is
T 0 ¼ f/; f1 ; f2 ; ff1 ; f2 gg. While the mutual information measures
V 3 ðS03  L00 Þ, the second part is V 3 ðS00  L03 Þ, and the third part is
dependence between a pair of variables, multivariate mutual infor-
V 3 ðfS02  L01 gÞ þ V 3 ðfS01  L02 gÞ. Thus, V k ðfS; Lg0 Þ can be rewritten as:
mation can account for dependencies among multiple variables.
X
k1

3.1. Decomposing high-dimensional entropy: HðSÞ and HðLÞ


V k ðfS; Lg0 Þ ¼ V k ðS0k  L00 Þ þ V k ðS00  L0k Þ þ V p ðS0kp  L0p Þ ð11Þ
p¼1

In this section, we decompose the high-dimensional joint entro-


py terms in Eq. (1): HðSÞ and HðLÞ. Let S be a set of n features, and Eq. (11) represents that any elements of cardinality k from
let X be one possible elements drawn from Sk ¼ feje 2 S; jej ¼ kg. fS; Lg0 can be divided into elements from S0k ¼ S0k  L00 , L0k ¼ S00  L0k ,
Then, the sum of entropies over all elements whose cardinality is and S0kp  L0p . Because the third part represents multivariate mu-
k can be defined as: tual information among variables chosen from a combination of S
X and L, there are no terms with k = 1. Hence, we rewrite Eq. (10)
U k ðSÞ ¼ HðXÞ ð4Þ
using Eq. (11) as follows:
X2Sk

X
nþt
 
This notation is more useful when input set is a power set. Sup- HðS; LÞ ¼  ð1Þk V k ðS0k  L00 Þ þ V k ðS00  L0k Þ
pose S ¼ ff1 ; f2 ; f3 g, then U 2 ðS0 Þ ¼ Hðf1 ; f2 Þ þ Hðf1 ; f3 Þ þ Hðf2 ; f3 Þ where k¼1
S0k ¼ feje 2 S0 ; jej ¼ kg. Let Y be a possible element drawn from X 0m !
X
nþt
k
X
k1
0 0
where m 6 k 6 n. By using Eq. (4), we obtain:  ð1Þ V p ðSkp  Lp Þ
0 1 k¼2 p¼1
X
n X
k
kþm @
XX X
n X
t
HðSÞ ¼ ð1Þ HðYÞA ð5Þ ¼  ð1Þk V k ðS0 Þ  ð1Þk V k ðL0 Þ
k¼1 m¼1 X2S0k Y2X 0m k¼1 k¼1

X
nþt X
k1
Proof is provided in Appendix A. To transform the high-dimen-  ð1Þk V p ðS0kp  L0p Þ ð12Þ
sional joint entropy estimation into a series of k-dimensional joint k¼2 p¼1
entropy estimation problems, we decompose HðSÞ into a sum of the
pieces of multivariate mutual information using Eq. (5): We can rewrite the mutual information between two sets by
0 1 combining Eqs. (8), (9), and (12):
X
n X X
k X
HðSÞ ¼ ð1Þk @ ð1Þm HðY m ÞA IðS; LÞ ¼ HðSÞ þ HðLÞ  HðS; LÞ
k¼1 X2S0k m¼1Y2X 0m X
n X
t X
n
! ¼  ð1Þk V k ðS0 Þ  ð1Þk V k ðL0 Þ þ ð1Þk V k ðS0 Þ
X X X
¼  ð1ÞjXj  ð1ÞjYj HðYÞ ¼  ð1ÞjXj IðfXgÞ ð6Þ k¼1 k¼1 k¼1

X2S0 Y2X 0 X2S0 X


t X
nþt X
k1
þ ð1Þk V k ðL0 Þ þ ð1Þk V k ðS0kp  L0p Þ
Similar to Eq. (4), we now define the sum of multivariate mu- k¼1 k¼2 p¼1
tual information with cardinality k as follows: X
nþt X
k1
X ¼ ð1Þk V k ðS0kp  L0p Þ ð13Þ
V k ðSÞ ¼ IðfXgÞ ð7Þ k¼2 p¼1
X2Sk
Finally, we get a k-dimensional representation of the mutual
Then, we can rewrite Eq. (6) as follows: information between two sets as shown in Eq. (13).
352 J. Lee, D.-W. Kim / Pattern Recognition Letters 34 (2013) 349–357

3.3. Feature selection algorithm 4.1. Data sets and evaluation

The present study focuses on developing a computationally effi- We experiment with three data sets from different applications;
cient algorithm to select a compact set of features. Thus, for com- bioinformatics, semantic scene analysis, and text categorization.
putational efficiency, we consider an approximated solution of Eq. The biological data set, Yeast, is concerned with gene function clas-
(13) by constraining the calculations of V k ðÞ functions with less sification of the Yeast Saccharomyces cerevisiae (Elisseeff and Wes-
than three cardinality. This can be obtained by replacing the limits ton, 2001). The gene functional classes made from the hierarchies
n þ t in the summations of Eq. (13) with three. By rewriting V k ðÞ that represent maximum 190 functions of genes. After preprocess-
functions into the multivariate mutual information terms, we ing, top four levels of hierarchies were chosen to compose the mul-
obtain: ti-labeled functions. The image data set, Scene, is concerned with
the semantic indexing of still scenes (Boutell et al., 2004). Each
eIðS; LÞ ¼ V 2 ðS0  L0 Þ  V 3 ðS0  L0 Þ  V 3 ðS0  L0 Þ
XX
1 1 2
XXX
1 1 2 scene possibly contains multiple objects such as desert, mountains,
¼ Iðffi ; lj gÞ  Iðffi ; fj ; lk gÞ sea and so on. Thus, those objects can be directly used to compose
fi 2S lj 2L fi 2S fj 2S lk 2L the multi-label of each still scene. The text data set, Enron, is a sub-
XXX set of the Enron email corpus (Klimt and Yang, 2004). An email
 Iðffi ; lj ; lk gÞ ð14Þ
fi 2S lj 2L lk 2L
may contain words according to its objective such as humor, admi-
ration, friendship, and so on. Thus, words and objectives of an
It is worth noting that if we want to obtain a more accurate va- email can be naturally encoded into the multi-label data set. Table
lue of the mutual information between S and L, then we should cal- 1 displays certain standard statistics of the data sets such as the
culate higher degree of relations, i.e., more than three, in Eq. (13); number of patterns, number of features, number of labels, and la-
however, it is computationally expensive because calculations of bel density (Tsoumakas and Katakis, 2007).
multivariate mutual information becomes prohibitive. From Eq. Both the Yeast and Scene data sets consist of continuous fea-
(14), we can easily derive our feature selection algorithm in the cir- tures. We discretized the Yeast and Scene data sets using the
cumstance of incremental selection. Suppose we have already se- Equal-width interval scheme to improve the computational effi-
lected a feature subset S; then we can see that to be selected ciency (Dougherty et al., 1995); each continuous feature was then
feature f þ should maximize eIðfS; f þ g; LÞ in the incremental selec- binarized into a categorical feature with two bins. Note that more
tion. Thus, the feature f þ in each step should maximize the follow- complex discretization schemes can be applied, or we can directly
ing equation: calculate the multivariate mutual information from continuous
features using multi-dimensional entropy estimation techniques
J ¼ eIðfS; f þ g; LÞ  eIðS; LÞ (Beirlant et al., 1997; Miller, 2003; Lee, 2010).
X X X X X
¼ Iðffi ; lj gÞ  Iðffi ; fj ; lk gÞ We compared our proposed method to three conventional
fi 2fS;f þ g lj 2L fi 2fS;f þ gfj 2fS;f þ g lk 2L methods: ELA + v2, PPT + v2, and PPT + MI. The classification per-
X XX XX formance of the four feature selection methods, including the pro-
 Iðffi ; lj ; lk gÞ  Iðffi ; lj gÞ
fi 2fS;f þ g lj 2L lk 2L fi 2S lj 2L
posed method, was measured using the multi-label naive bayes
XXX XXX (MLNB) classifier (Zhang et al., 2009). We evaluated the perfor-
þ Iðffi ; fj ; lk gÞ þ Iðffi ; lj ; lk gÞ mance of the methods using a 30% hold-out set. Specifically, 70%
fi 2S fj 2S lk 2L fi 2S lj 2L lk 2L
X XX XX of patterns randomly chosen from the data set were used for the
¼ Iðff þ ; li gÞ  Iðff þ ; fi ; lj gÞ  Iðff þ ; li ; lj gÞ ð15Þ training process, and the remaining 30% were used for measuring
li 2L fi 2S lj 2L li 2L lj 2L the performance of each feature selection method. Those experi-
ments were applied 30 times iteratively, and the average value
was taken to represent the classification performance. In the mul-
Algorithm 1. Proposed multi-label feature selection
ti-label classification problem, performance can be assessed by sev-
algorithm
eral evaluation measures. We employed four conventional
1: Input: n; . Number of to be selected features evaluation measures: Hamming loss, Ranking loss, Coverage, and
2: Output: S; . Selected feature subset multi-label accuracy (Boutell et al., 2004; Tsoumakas and Vlahavas,
3: Initialize S f/g and F ff1 ; . . . ; fd g; 2007). Let P ¼ fðP i ; ki Þj1 6 i 6 pg be a given test set where ki # L is a
4: repeat correct label subset, and Y i # L be a predicted label set corresponds
P
5: Find the feature f þ 2 F maximizing Eq. (15); to P i . The Hamming loss is defined to be hlossðPÞ ¼ 1p pi¼1 1t jki MY i j
6: Set S fS [ f þ g, and F F n S; where M denotes the symmetric difference between two sets. In
7: until jSj ¼ n most cases, a multi-label classifier could output the real-valued
8: Output the set S containing the selected features; likelihood yj between Pi and each label lj 2 L. The Ranking loss mea-
sures ranking quality  of those likelihood, defined  to as
P
rlossðPÞ ¼ 1p pi¼1 jk 1jjk j fðy1 ; y2 Þjy1 6 y2 ; ðy1 ; y2 Þ 2 ki  ki g where ki
i i
Given a set of already selected features, the algorithm chooses denotes the complementary set of ki . Moreover, those likelihood
the next features as the one that maximizes the criterion J under can be ranked according to its value, for example, if y1 > y2 then
incremental selection. The proposed algorithm can be described rankðy1 Þ < rankðy2 Þ. Then the Coverage can be defined as
P
by the Algorithm 1. Note that this is a greedy algorithm; even cov ðPÞ ¼ 1p pi¼1 maxy2ki rankðyÞ  1. In addition, the multi-label
though it is not guaranteed to find the global maximum value of
J, it provides a computationally efficient performance for various
applications.
Table 1
Brief description of multi-label data sets.
4. Experimental results
Name Domain Patterns Features Labels Density

In this section, we verify the effectiveness of the proposed Scene Image 2407 294 6 0.179
Enron Text 1702 1001 53 0.064
method by comparing its performance against conventional mul-
Yeast Biology 2417 103 14 0.303
ti-label feature selection methods.
J. Lee, D.-W. Kim / Pattern Recognition Letters 34 (2013) 349–357 353

1
Pp jki \Y i j
accuracy is defined to be mlaccðPÞ ¼ p i¼1 jki [Y i j . The Hamming conventional methods for any size of selected feature subset.
loss evaluates how many times an pattern-label pair is misclassi- Fig. 1(a) shows that the Hamming loss of the proposed method im-
fied, and other three measures concern the ranking quality of differ- proved with the size of the selected feature subset. However, the
ent labels for each test pattern. The first three evaluation measures Hamming loss of other conventional methods rapidly degraded
indicate good classification performance when evaluated as low as the size of the selected feature subset ranged from 1 to 10,
values, whereas the last evaluation measure, multi-label accuracy, and then, the performance slowly improved as the size of the fea-
indicates good classification performance when the classifier ture subset grew larger. The Hamming loss of the feature subsets
achieves high values. These four evaluation measures demonstrate selected using ELA + v2, PPT + v2, and PPT + MI were 0.3344,
different aspect of multi-label classification performance. 0.3356, and 0.3316, respectively, with 20 features selected,
whereas the Hamming loss of the selected feature subset according
4.2. Comparison results to the proposed method was 0.1394. Thus, we can see that the per-
formance of the proposed method showed an improvement of
Fig. 1 shows the classification performance of each feature 0.1962 over the conventional PPT + v2. Fig. 1(b) shows that the
selection method for the Scene data set. The horizontal axis repre- Ranking loss was improved to a great extent by using the proposed
sents the size of the selected feature subset according to each method, and this tendency was consistent with the comparison re-
feature selection method, and the vertical axis indicates the classi- sults of the Coverage and multi-label accuracy, as shown in
fication performance of certain evaluation measures. The proposed Fig. 1(c) and (d). The Ranking loss of the feature subsets selected
method showed superior classification performance to other using the proposed method was 0.1344 when 20 features were

Experimental Results on Scene Data set Experimental Results on Scene Data set
0.3
0.34
ELA+CHI+MLNB
0.32 0.28 PPT+CHI+MLNB
PPT+MI+MLNB
0.3 0.26 Proposed+MLNB

0.28
Hamming Loss

0.24
Ranking Loss

0.26
0.22
0.24
ELA+CHI+MLNB 0.2
0.22 PPT+CHI+MLNB
PPT+MI+MLNB 0.18
0.2
Proposed+MLNB
0.18 0.16

0.16
0.14
0.14
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50
Number of Input Features Number of Input Features

Experimental Results on Scene Data set Experimental Results on Scene Data set

0.5
ELA+CHI+MLNB
2.5
PPT+CHI+MLNB 0.45
PPT+MI+MLNB
2.4
Proposed+MLNB 0.4
Multi−label Accuracy

2.3 0.35
Coverage

2.2 0.3

0.25
2.1
0.2
2
0.15
ELA+CHI+MLNB
1.9 0.1 PPT+CHI+MLNB
PPT+MI+MLNB
1.8 0.05 Proposed+MLNB

0
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50

Number of Input Features Number of Input Features

Fig. 1. Classification performance of the Scene data set according to feature subsets using the proposed method and three conventional feature selection methods: PPT + v2,
PPT + MI, and ELA + v2.
354 J. Lee, D.-W. Kim / Pattern Recognition Letters 34 (2013) 349–357

selected, whereas the Ranking loss of the feature subsets selected conventional methods. The Hamming loss of the feature subsets
using ELA + v2, PPT + v2, and PPT + MI were 0.2369, 0.2370, and selected using ELA + v2, PPT + v2, and PPT + MI were 0.0976,
0.2344, respectively. Thus, we can conclude that the proposed 0.0930, and 0.0949, respectively, for 20 selected features,
method selected a more effective feature subset than the other whereas the Hamming loss of the selected feature subset of
conventional feature selection methods. The best classification the proposed method was 0.0631. Fig. 2(b) shows that the four
performance evaluated by Hamming loss, Ranking loss, Coverage, feature selection methods started with a similar Ranking loss
and multi-label accuracy were all achieved using the proposed value until 15 features were selected. However, with the grow-
method, with scores of 0.1394, 0.1280, 1.7296, and 0.5348 with ing size of the selected feature subset, our proposed method
different sizes of selected feature subsets. showed better performance than three other methods. The
Fig. 2 shows the classification performance of each feature Ranking loss of the feature subsets in accordance with the pro-
selection method for the Enron data set. The proposed method posed method was 0.1251 when 20 features were selected,
shows better classification performance compared to other con- whereas ELA + v2, PPT + v2, and PPT + MI had achieved a Rank-
ventional methods in terms of Hamming loss. Fig. 2(a) shows ing loss of 0.1453, 0.1457, and 0.1400, respectively, for the same
that our proposed method demonstrated superior performance size of feature subsets. We can see a similar tendency in
compared to other conventional methods in the Hamming loss Fig. 2(c) and (d). Hence, our proposed method has shown its
experiment. For any size of the selected feature subsets, the superiority compared to the other three conventional methods
proposed method showed better performance than the other in the classification of the Enron data set.

Experimental Results on Enron Data set Experimental Results on Enron Data set

0.14 0.17
ELA+CHI+MLNB ELA+CHI+MLNB
0.13 PPT+CHI+MLNB PPT+CHI+MLNB
PPT+MI+MLNB 0.16 PPT+MI+MLNB
Proposed+MLNB Proposed+MLNB
0.12
Hamming Loss

0.11 0.15
Ranking Loss

0.1
0.14
0.09

0.13
0.08

0.07
0.12

0.06

5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50
Number of Input Features Number of Input Features

Experimental Results on Enron Data set Experimental Results on Enron Data set

20.5 0.32
ELA+CHI+MLNB
20 PPT+CHI+MLNB
PPT+MI+MLNB 0.3
19.5 Proposed+MLNB
Multi−label Accuracy

0.28
19
Coverage

0.26
18.5

18 0.24

17.5
0.22
17 ELA+CHI+MLNB
0.2 PPT+CHI+MLNB
16.5 PPT+MI+MLNB
0.18 Proposed+MLNB
16

5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50

Number of Input Features Number of Input Features

Fig. 2. Classification performance of the Enron data set according to feature subsets using the proposed method and three conventional feature selection methods: PPT + v2,
PPT + MI, and ELA + v2.
J. Lee, D.-W. Kim / Pattern Recognition Letters 34 (2013) 349–357 355

Experimental Results on Yeast Data set Experimental Results on Yeast Data set
0.24
0.27 ELA+CHI+MLNB ELA+CHI+MLNB
PPT+CHI+MLNB 0.235 PPT+CHI+MLNB
0.265
PPT+MI+MLNB PPT+MI+MLNB
0.26 Proposed+MLNB 0.23 Proposed+MLNB
Hamming Loss

0.255 0.225

Ranking Loss
0.25
0.22
0.245
0.215
0.24
0.21
0.235

0.23 0.205

0.225 0.2

5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50
Number of Input Features Number of Input Features

Experimental Results on Yeast Data set Experimental Results on Yeast Data set

0.46
8.4

0.44
8.3
Multi−label Accuracy

8.2 0.42
Coverage

ELA+CHI+MLNB
8.1 PPT+CHI+MLNB
0.4
PPT+MI+MLNB
Proposed+MLNB
8
0.38
7.9 ELA+CHI+MLNB
PPT+CHI+MLNB 0.36
7.8 PPT+MI+MLNB
Proposed+MLNB
7.7 0.34

5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50
Number of Input Features Number of Input Features

Fig. 3. Classification performance of the Yeast data set according to feature subsets using the proposed method and three conventional feature selection methods: PPT + v2,
PPT + MI, and ELA + v2.

Fig. 3 shows the classification performance of each feature selec- and multi-label accuracy were similar to the Ranking loss results, as
tion method for the Yeast data set. The comparison results for the shown in Fig. 3(c) and (d). Although the proposed method showed
Hamming loss are shown in Fig. 3(a). The proposed method always better performance than other methods, the gain in classification
showed better Hamming loss compared to the other three conven- performance was not large enough because the dependency of each
tional methods. The Hamming loss of the feature subsets selected feature in Yeast data set is similar to each other.
using ELA + v2, PPT + v2, and PPT + MI were 0.2494, 0.2390, and
0.2397, respectively, for 20 selected features, whereas the Ham-
ming loss of the selected feature subset of the proposed method 5. Conclusions
was 0.2273. Thus, the proposed method shows better classification
performance than other conventional methods in terms of Ham- In this paper, we presented a multivariate mutual information-
ming loss. The Ranking loss of the proposed methods was shown based feature selection method for multi-label classification. Our pro-
to have better performance when the size of the selected features posed method does not rely on any problem transformation method
subset ranged from 15 to 30. The Ranking loss of the feature subsets to select a relevant feature subset. To efficiently evaluate the depen-
in accordance with the proposed method was 0.2047 when 20 fea- dency of input features in multivariate situations, the proposed meth-
tures were selected, whereas ELA + v2, PPT + v2, and PPT + MI had od calculates three-dimensional interactions among features and
achieved 0.2171, 0.2148, and 0.2168, respectively, with the same labels instead of the calculating prohibitive high-dimensional density
size of feature subsets. The comparison results of both the Coverage estimations. Our comprehensive experiments demonstrate that the
356 J. Lee, D.-W. Kim / Pattern Recognition Letters 34 (2013) 349–357

P P
classification performance can be significantly improved by the pro- represented by a linear function of Eq. (4). Let X2S0k Y2X 0m HðYÞ =
posed method. Comparison results on three real-world data sets ak;m U m ðS0 Þ; then, Eq. (5) can be represented as:
emerging from different domains show the advantage of the pro- 0 1
posed method compared with the three conventional methods based X
n X
k
kþm @
XX
ð1Þ HðYÞA
on three problem transformation methods and two score measures in k¼1 m¼1 X2S0k Y2X 0m
terms of four multi-label performance measures: the Hamming loss,
Ranking loss, Coverage, and multi-label accuracy. Thus, we showed
n X
X k
¼ ð1Þkþm ak;m U m ðS0 Þ where m 6 k ð17Þ
that the proposed method can find very effective feature subsets for
k¼1 m¼1
the multi-label classification problem.
Future work should include the study of the influence of the The coefficient ak;m is determined by the generator. First, X is
approximation accuracy; because the proposed method only con- chosen from S0k , and the number of possible X 2 S0k can be repre-
 
siders three-dimensional interactions among features and labels, n
sented as . In addition, Y is generated from the element of
it may lose important information with higher-order label depen- k
 
dency. However, this leads to two practical difficulties: high- k
X 0m with cardinality m, and thus, we simply represent it as ,
dimensional density estimation with limited size of training pat- m
 
terns and expensive computational cost of high-order multivariate n
where 1 6 m 6 k. Finally, the term U m ðS0 Þ is composed of
mutual information. As a future work, we would like to study this m
issue further. of the entropy terms. Thus, we can formalize the coefficient ak;m as:
  
n k
Appendix A
k m ðn  mÞ!
ak;m ¼   ¼ ð18Þ
n ðn  kÞ!ðk  mÞ!
Eq. (5) indicates that the entropy of a variable set is equal to the
sum of entropy of the power set of a set S with Möbius inversion. It m


is easy to imagine that if we consider the Pascal’s Triangle, in which nm
Further, we can simplify the ak;m as , where m 6 k 6 n.
each element is determined by sum of two elements in its upper km
row. Suppose we take the values in a row except first row. In addi- Thus, by combining Eqs. (17) and (18) using the Pascal’s Triangle,
tion, if values located in even position at respective rows take posi- we obtain:
tive, and values located in odd position take negative, then sum of Xn X k Xn X k  
nm
these values should be zero. The proof of Eq. (5) uses this property; ð1Þkþm ak;m U m ðS0 Þ ¼ ð1Þkþm U m ðS0 Þ ð19Þ
km
a sum of entropy of the power set of a set S with Möbius inversion k¼1 m¼1 k¼1 m¼1

would be equal to the entropy of S itself, since the entropy of power


sets of S are self-eliminated by the binomial theorem except the en- Eq. (19) represents the sum of entropies in kth row on Pascal’s
tropy of the variable set S; a value located in the first row. Triangle. We transform the limits of the summations by rewriting
Let X be a possible elements drawn from S0k , and let Y be a pos- row-wise summations to column-wise summations in Pascal’s
sible elements drawn from X 0m where m 6 k 6 n. Since our goal is Triangle as follows:
to show the equality should be satisfied, we swap the left hand side
Xn X k   Xn X n  
of Eq. (5) and the right hand side for simplicity. Thus the equation nm nm
ð1Þkþm U m ðS0 Þ ¼ ð1Þkþm U m ðS0 Þ
is rewritten as: km km
k¼1 m¼1 m¼1 k¼m
X
n X
k X X 
ð1Þkþm HðYÞ ¼ HðSÞ ð16Þ ð20Þ
X2S0k Y2X 0m
k¼1 m¼1 |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} where m 6 k. Replace k with k þ m:
Part 1
Xn Xn  
nm
ð1Þkþm U m ðS0 Þ
m¼1k¼m
km
Proof. Since Part 1 means that the sum over all possible elements  
Xn X
nm
nm
from X 0k with the largest cardinality is constrained by k, we can see ¼ ð1ÞðkþmÞþm U m ðS0 Þ
that this term can be represented as a sum of a series of several m¼1kþm¼m
ðk þ mÞ  m
joint entropy terms. We illustrate the behavior of Part 1 with X X
n nm  
nm
changing k and m in Table 2. For example, the series can be ¼ ð1Þk U m ðS0 Þ ð21Þ
k
represented by Eq. (4) as 2U 1 ðS0 Þ ¼ 2ðHðf1 Þ þ Hðf2 Þ þ Hðf3 ÞÞ when k m¼1 k¼0

= 2 and m = 1 in Table 2. Thus, we can see that Part 1 can be  


Pn j n
Because j¼0 ð1Þ ¼ 0 where n > 0 by the binomial theo-
j
Pn Pn1
rem, we separate m¼1 into m¼1 and m ¼ n, since m ¼ n then
Table 2 Pnm P0
An example of Part 1 in Eq. (5) when S ¼ ff1 ; f2 ; f3 g. The X represents possible elements k¼0 turns to k¼0 in Eq. (21), we obtain:
drawn from S0k , whereas Y represents the possible elements drawn from X 0m . As seen in
Xn X
nm   X
n1 X
nm  
nm nm
the table, the entropy sum according to k and m can be easily represented by Eq. (4). ð1Þk U m ðS0 Þ ¼ ð1Þk U m ðS0 Þ þ U n ðS0 Þ
m¼1 k¼0
k m¼1 k¼0
k
S01 S02 S03
n1 X
X nm   X
n1
ff1 g ff2 g ff3 g ff1 ; f2 g ff1 ; f3 g ff2 ; f3 g ff1 ; f2 ; f3 g k nm 0
¼ ð1Þ U m ðS Þ þ HðSÞ ¼ ð0Þ þ HðSÞ ¼ HðSÞ
Hðf1 Þ Hðf2 Þ Hðf3 Þ Hðf1 Þ Hðf1 Þ Hðf2 Þ Hðf1 Þ m¼1 k¼0
k m¼1
X 01 Hðf2 Þ Hðf3 Þ Hðf3 Þ Hðf2 Þ ð22Þ
Hðf3 Þ
Hðf1 ; f2 Þ Hðf1 ; f3 Þ Hðf2 ; f3 Þ Hðf1 ; f2 Þ
X 02 Hðf1 ; f3 Þ Eq. (22) indicates that the entropy of variable set S can be
Hðf2 ; f3 Þ
written as a combination of the entropies computed from
X 03 Hðf1 ; f2 ; f3 Þ
subsets. h
J. Lee, D.-W. Kim / Pattern Recognition Letters 34 (2013) 349–357 357

Appendix B Chen, W., Yan, J., Zhang, B., Chen, Z., Yang, Q., 2007. Document transformation for
multi-label feature selection in text categorization. In: Proc. Seventh IEEE
Internat Conf. of Data Mining (ICDM’07), pp. 451–456.
This relation can be easily confirmed using properties of the Diplaris, S., Tsoumakas, G., Mitkas, P., Vlahavas, I., 2005. Protein classification with
power set. Let S be a set of n variables, and S0 denotes the power multiple algorithms. Adv. Inf. 3746, 448–456.
Doquire, G., Verleysen, M., 2011. Feature selection for multi-label classification
set of S. Suppose we add a set of t variables L. Then fS; Lg0 can be
problems. Adv. Comput. Intell. 6691, 9–16.
represented as: Dougherty, J., Kohavi, R., Sahami, M., 1995. Supervised and unsupervised
discretization of continuous features. In: Internat. Worksh. Conf. on Machine
fS; Lg0 ¼ fS0  L0 g ¼ fS00  L00 ; S00  L01 ; . . . ; S0n  L0t g ð23Þ Learning. Morgan Kaufmann Publishers, Inc., pp. 194–202.
Elisseeff, A., Weston, J., 2001. A kernel method for multi-labelled classification. Adv.
where  denotes the cartesian product between two sets. Let us Neural Inf. Process. Systems 14, 681–687.
Gu, Q., Li, Z., Han, J., 2011. Correlated multi-label feature selection. In: Proc. 20th
illustrate a situation of including new variables to Eq. (8). Since a ACM Internat. Conf. on Information and Knowledge Management. ACM, pp.
set of variables are newly included, we should consider additional 1087–1096.
relations among them. This can be written as: Guyon, I., Elisseeff, A., 2003. An introduction to variable and feature selection. J.
Machine Learn. Res. 3, 1157–1182.
Xn X
t Klimt, B., Yang, Y., 2004. The enron corpus: A new dataset for email classification
HðS; LÞ ¼  ð1Þkþm V kþm ðS0k  L0m Þ ð24Þ research. Lect. Note Comput. Sci. 3201, 217–226.
Lee, I., 2010. Sample-spacings-based density and entropy estimators for spherically
k¼0 m¼0
invariant multidimensional data. Neural Comput. 22, 2208–2227.
Lewis, D., Yang, Y., Rose, T., Li, F., 2004. Rcv1: A new benchmark collection for text
categorization research. J. Machine Learn. Res. 5, 361–397.
Eq. (24) can be more simplified using V kþm ðÞ function. For McGill, W., 1954. Multivariate information transmission. IRE Trans. Inf. Theory 4,
example, V 0þ2 ðS00  L02 Þ þ V 1þ1 ðS01  L01 Þ þ V 2þ0 ðS02  L00 Þ can be repre- 93–111.
Miller, E., 2003. A new class of entropy estimators for multi-dimensional densities.
sented as V 2 ðS0  L0 Þ. Thus we can rewrite Eq. (24) as follows:
In: Proc. 2003 IEEE Internat. Conf. on Acoustic, Speech and Signal Processing
(ICASSP’03). IEEE, pp. 297–300.
X
nþt
Read, J., 2008. A pruned problem transformation method for multi-label
HðS; LÞ ¼  ð1Þk V k ðS0  L0 Þ ð25Þ classification. In: Proc. 2008 New Zealand Comput. Sci. Res. Stud. Conf.
k¼0 (NZCSRS’08), pp. 143–150.
Saeys, Y., Inza, I., Larrañaga, P., 2007. A review of feature selection techniques in
Since V 0 ðÞ ¼ 0 and fS0  L0 g ¼ fS; Lg0 respectively, Eq. (25) can bioinformatics. Bioinformatics 23, 2507–2517.
be rewritten as: Schapire, R., Singer, Y., 2000. Boostexter: A boosting-based system for text
categorization. Machine Learn. 39, 135–168.
X
nþt Sebastiani, F., 2002. Machine learning in automated text categorization. ACM
HðS; LÞ ¼  ð1Þk V k ðfS; Lg0 Þ ð26Þ Comput. Surv. 34, 1–47.
k¼1 Sun, Y., Wong, A., Kamel, M., 2009. Classification of imbalanced data: A review. Int. J.
Pattern Recognition Artif. Intell. 23, 687.
Trohidis, K., Tsoumakas, G., Kalliris, G., Vlahavas, I., 2008. Multilabel classification of
music into emotions. In: Proc. Ninth Internat. Conf. Music Inform. Retrieval
Acknowledgement (ISMIR’08), Philadelphia, PA, USA.
Tsoumakas, G., Katakis, I., 2007. Multi-label classification: An overview. Internat. J.
This research was supported by Basic Science Research Program Data Warehouse Min. 3, 1–13.
Tsoumakas, G., Katakis, I., Vlahavas, I., 2011. Random k-labelsets for multi-label
through the National Research Foundation of Korea (NRF) funded classification. IEEE Trans. Knowl. Data Eng. 23, 1079–1089.
by the Ministry of Education, Science and Technology (2012- Tsoumakas, G., Vlahavas, I., 2007. Random k-labelsets: An ensemble method for
0001772). multilabel classification. Machine Learn. (ECML’07) 4701, 406–417.
Watanabe, S., 1969. Knowing and Guessing: A Quantitative Study of Inference and
Information. Wiley, New York.
References Yang, Y., Pedersen, J., 1997. A comparative study on feature selection in text
categorization. In: Proc. 14th Internat. Conf. on Machine Learning, pp. 412–420.
Beirlant, J., Dudewicz, E., Györfi, L., Van der Meulen, E., 1997. Nonparametric Zhang, M., Peña, J., Robles, V., 2009. Feature selection for multi-label naive bayes
entropy estimation: An overview. Internat. J. Math. Statist. Sci. 6, 17–40. classification. Inf. Sci. 179, 3218–3229.
Boutell, M., Luo, J., Shen, X., Brown, C., 2004. Learning multi-label scene Zhang, M., Zhou, Z., 2007. ML-KNN: A lazy learning approach to multi-label
classification. Pattern Recognition 37, 1757–1771. learning. Pattern Recognition 40, 2038–2048.

You might also like