You are on page 1of 15

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication.


IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
1

A Fuzzy Self-Constructing Feature Clustering


Algorithm for Text Classification
Jung-Yi Jiang, Ren-Jia Liou, and Shie-Jue Lee, Member, IEEE

Abstract—Feature clustering is a powerful method to reduce the di- Margin Criterion [12], and Orthogonal Centroid algorithm [17]
mensionality of feature vectors for text classification. In this paper, we perform the projection by linear transformations, while Locally
propose a fuzzy similarity-based self-constructing algorithm for feature Linear Embedding [18], ISOMAP [19], and Laplacian Eigen-
clustering. The words in the feature vector of a document set are
maps [20] do feature extraction by nonlinear transformations.
grouped into clusters based on similarity test. Words that are similar
to each other are grouped into the same cluster. Each cluster is charac- In practice, linear algorithms are in wider use due to their
terized by a membership function with statistical mean and deviation. efficiency. Several scalable online linear feature extraction al-
When all the words have been fed in, a desired number of clusters gorithms [14], [21], [22], [23] have been proposed to improve
are formed automatically. We then have one extracted feature for each the computational complexity. However, the complexity of
cluster. The extracted feature corresponding to a cluster is a weighted these approaches is still high. Feature clustering [24], [25],
combination of the words contained in the cluster. By this algorithm, the
[26], [27], [28], [29] is one of effective techniques for feature
derived membership functions match closely with and describe properly
the real distribution of the training data. Besides, the user need not
reduction in text classification. The idea of feature clustering is
specify the number of extracted features in advance, and trial-and-error to group the original features into clusters with a high degree
for determining the appropriate number of extracted features can then of pairwise semantic relatedness. Each cluster is treated as
be avoided. Experimental results show that our method can run faster a single new feature and thus feature dimensionality can be
and obtain better extracted features than other methods. drastically reduced.
The first feature extraction method based on feature clus-
Index Terms—Fuzzy similarity, feature clustering, feature extraction,
tering was proposed by Baker and McCallum [24], which was
feature reduction, text classification. derived from the “distributional clustering” idea of Pereira
et al. [30]. Al-Mubaid and Umair [31] used distributional
clustering to generate an efficient representation of documents
1 I NTRODUCTION and applied a learning logic approach for training text classi-

I N text classification, the dimensionality of the feature vector


is usually huge. For example, 20 Newsgroups [1] and
Reuters21578 top-10 [2], which are two real-world datasets,
fiers. The Agglomerative Information Bottleneck approach was
proposed by Tishby et al. [25], [29]. The divisive information-
theoretic feature clustering algorithm was proposed by Dhillon
both have more than 15,000 features. Such high dimensionality et al. [27], which is an information-theoretic feature clustering
can be a severe obstacle for classification algorithms [3], [4]. approach and more effective than other feature clustering
To alleviate this difficulty, feature reduction approaches are methods. In these feature clustering methods, each new feature
applied before document classification tasks are performed [5]. is generated by combining a subset of the original words.
Two major approaches, feature selection [6], [7], [8], [9], [10] However, difficulties are associated with these methods. A
and feature extraction [11], [12], [13], have been proposed for word is exactly assigned to a subset, i.e., hard-clustering, based
feature reduction. In general, feature extraction approaches are on the similarity magnitudes between the word and the existing
more effective than feature selection techniques, but are more subsets, even if the differences among these magnitudes are
computationally expensive [11], [12], [14]. Therefore, devel- small. Also, the mean and the variance of a cluster are
oping scalable and efficient feature extraction algorithms is not considered when similarity with respect to the cluster is
highly demanded for dealing with high-dimensional document computed. Furthermore, these methods require the number of
datasets. new features be specified in advance by the user.
Classical feature extraction methods aim to convert the We propose a fuzzy similarity-based self-constructing fea-
representation of the original high-dimensional dataset into ture clustering algorithm which is an incremental feature
a lower-dimensional dataset by a projecting process through clustering approach to reduce the number of features for the
algebraic transformations. For example, Principal Component text classification task. The words in the feature vector of a
Analysis [15], Linear Discriminant Analysis [16], Maximum document set are represented as distributions and processed
one after another. Words that are similar to each other are
This work was supported by the National Science Council under the grants
NSC-95-2221-E-110-055-MY2 and NSC-96-2221-E-110-009. grouped into the same cluster. Each cluster is characterized
Corresponding author: Shie-Jue Lee, leesj@mail.ee.nsysu.edu.tw by a membership function with statistical mean and deviation.

Digital Object Indentifier 10.1109/TKDE.2010.122 1041-4347/10/$26.00 © 2010 IEEE


This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
2

If a word is not similar to any existing cluster, a new cluster is where P (cl ) denotes the prior probability for class cl , P (wj )
created for this word. Similarity between a word and a cluster denotes the prior probability for feature wj , P (wj ) is identical
is defined by considering both the mean and the variance of to 1 − P (wj ), and P (cl |wj ) and P (cl |wj ) denote the proba-
the cluster. When all the words have been fed in, a desired bility for class cl with the presence and absence, respectively,
number of clusters are formed automatically. We then have of wj . The words of top k weights in W are selected as the
one extracted feature for each cluster. The extracted feature features in W .
corresponding to a cluster is a weighted combination of the In feature extraction approaches, extracted features are
words contained in the cluster. Three ways of weighting, obtained by a projecting process through algebraic transforma-
hard, soft, and mixed, are introduced. By this algorithm, the tions. An incremental orthogonal centroid (IOC) algorithm was
derived membership functions match closely with and describe proposed in [14]. Let a corpus of documents be represented
properly the real distribution of the training data. Besides, as an m × n matrix X ∈ Rm×n , where m is the number of
the user need not specify the number of extracted features features in the feature set and n is the number of documents in
in advance, and trial-and-error for determining the appropriate the document set. IOC tries to find an optimal transformation
number of extracted features can then be avoided. Experiments matrix F∗ ∈ Rm×k , where k is the desired number of extracted
on real-world datasets show that our method can run faster and features, according to the following criterion:
obtain better extracted features than other methods.
The remainder of this paper is organized as follows. Sec- F∗ = arg max trace (FT Sb F) (2)
tion 2 gives a brief background about feature reduction.
where F ∈ Rm×k and FT F = I, and
Section 3 presents the proposed fuzzy similarity-based self-
p

constructing feature clustering algorithm. An example illus-
trating how the algorithm works is given in Section 4. Experi- Sb = P (cq )(Mq − Mall )(Mq − Mall )T (3)
q=1
mental results are presented in Section 5. Finally, we conclude
this work in Section 6. with P (cq ) being the prior probability for a pattern belonging
to class cq , Mq being the mean vector of class cq , and Mall
2 BACKGROUND AND RELATED WORK being the mean vector of all patterns.
To process documents, the bag-of-words model [32], [33] is
commonly used. Let D = {d1 , d2 , . . . , dn } be a document
2.2 Feature Clustering
set of n documents, where d1 , d2 , . . . , dn are individual
documents and each document belongs to one of the classes Feature clustering is an efficient approach for feature reduc-
in the set {c1 , c2 , ......, cp }. If a document belongs to two tion [25], [29], which groups all features into some clusters
or more classes, then two or more copies of the document where features in a cluster are similar to each other. The
with different classes are included in D. Let the word set feature clustering methods proposed in [24], [25], [27], [29]
W = {w1 , w2 , . . . , wm } be the feature vector of the document are “hard” clustering methods where each word of the original
set. Each document di , 1 ≤ i ≤ n, is represented as di = features belongs to exactly one word cluster. Therefore each
< di1 , di2 , . . . , dim >, where each dij denotes the number of word contributes to the synthesis of only one new feature. Each
occurrences of wj in the ith document. The feature reduction new feature is obtained by summing up the words belonging
task is to find a new word set W = {w1 , w2 , . . . , wk }, k << to one cluster. Let D be the matrix consisting of all the
m, such that W and W work equally well for all the desired original documents with m features and D be the matrix
properties with D. After feature reduction, each document di is consisting of the converted documents with new k features.
converted into a new representation di = < di1 , di2 , . . . , dik > The new feature set W = {w1 , w2 , . . . , wk } corresponds to
and the converted document set is D = {d1 , d2 , . . . , dn }. If  {W1 , W2 , . . . , Wk } of the original feature set W,
a partition
k is much smaller than m, computation cost with subsequent i.e., Wt Wq = ∅, where 1 ≤ q, t ≤ k and t = q. Note that
operations on D can be drastically reduced. a cluster corresponds to an element in the partition. Then, the
tth feature value of the converted document di is calculated
2.1 Feature Reduction as follows: 
In general, there are two ways of doing feature reduction, dit = dij (4)
feature selection and feature extraction. By feature selection wj ∈Wt

approaches, a new feature set W = {w1 , w2 , . . . , wk } is which is a linear sum of the feature values in Wt .
obtained, which is a subset of the original feature set W. The divisive information-theoretic feature clustering (DC)
Then W is used as inputs for classification tasks. Information algorithm, proposed by Dhillon et al. [27] calculates the dis-
Gain (IG) is frequently employed in the feature selection tributions of words over classes, P (C|wj ), 1 ≤ j ≤ m, where
approach [10]. It measures the reduced uncertainty by an C = {c1 , c2 , ......, cp }, and uses Kullback-Leibler divergence
information-theoretic measure and gives each word a weight. to measure the dissimilarity between two distributions. The
The weight of a word wj is calculated as follows: distribution of a cluster Wt is calculated as follows:
p
IG(wj ) = − l=1 P(cl )logP (cl )  P (wj )
p 
+P (wj ) l=1 P (cl |wj )logP (cl |wj ) (1) P (C|Wt ) = P (C|wj ). (5)
p
+P (wj ) l=1 P (cl |wj )logP (cl |wj ) wj ∈Wt P (wj )
wj ∈Wt
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
3

The goal of DC is to minimize the following objective adopted because of their superiority over other functions in
function: performance [34], [35]. Let G be a cluster containing q word
k
  patterns x1 , x2 , . . . , xq . Let xj = < xj1 , xj2 , . . . , xjp >, 1 ≤
P (wj )KL(P (C|wj ), P (C|Wt )) (6) j ≤ q. Then the mean m = < m1 , m2 , . . . , mp > and the
t=1 wj ∈Wt deviation σ = < σ1 , σ2 , . . . , σp > of G are defined as
q
which takes the sum over all the k clusters, where k is specified j=1 xji
by the user in advance. mi = (11)
|G|

q 2
j=1 (xji − mji )
3 O UR METHOD σi = (12)
|G|
There are some issues pertinent to most of the existing feature
clustering methods. Firstly, the parameter k, indicating the for 1 ≤ i ≤ p, where |G| denotes the size of G, i.e., the
desired number of extracted features, has to be specified in number of word patterns contained in G. The fuzzy similarity
advance. This gives a burden to the user since trial-and-error of a word pattern x = < x1 , x2 , . . . , xp > to cluster G is
has to be done until the appropriate number of extracted defined by the following membership function:
features is found. Secondly, when calculating similarities, the p

2
variance of the underlying cluster is not considered. Intuitively,  xi − mi
μG (x) = exp − . (13)
the distribution of the data in a cluster is an important factor σi
i=1
in the calculation of similarity. Thirdly, all words in a cluster
have the same degree of contribution to the resulting extracted Notice that 0 ≤ μG (x) ≤ 1. A word pattern close to the mean
feature. Sometimes it may be better if more similar words are of a cluster is regarded to be very similar to this cluster, i.e.,
allowed to have bigger degrees of contribution. Our feature μG (x) ≈ 1. On the contrary, a word pattern far distant from
clustering algorithm is proposed to deal with these issues. a cluster is hardly similar to this cluster, i.e., μG (x) ≈ 0. For
Suppose we are given a document set D of n documents d1 , example, suppose G1 is an existing cluster with a mean vector
d2 , . . . , dn , together with the feature vector W of m words m1 = < 0.4, 0.6 > and a deviation vector σ 1 = < 0.2, 0.3 >.
w1 , w2 , . . . , wm and p classes c1 , c2 , . . . , cp , as specified in The fuzzy similarity of the word pattern x1 shown in Eq.(10)
Section 2. We construct one word pattern for each word in to cluster G1 becomes
W. For word wi , its word pattern xi is defined, similarly as
μG1 (x1 )
in [27], by 2  2 
= exp − 0.3−0.4 × exp − 0.7−0.6
xi = < xi1 , xi2 , . . . , xip > 0.2 0.3

= < P (c1 |wi ), P (c2 |wi ), . . . , P (cp |wi ) > (7) = 0.7788 × 0.8948 = 0.6969. (14)

where n
q=1 dqi ×δqj 3.1 Self-Constructing Clustering
P (cj |wi ) =  n (8)
q=1 dqi Our clustering algorithm is an incremental, self-constructing
for 1 ≤ j ≤ p. Note that dqi indicates the number of learning approach. Word patterns are considered one by one.
occurrences of wi in document dq , as described in Section 2. The user does not need to have any idea about the number
Also, δqj is defined as of clusters in advance. No clusters exist at the beginning, and
 clusters can be created if necessary. For each word pattern,
1, if document dq belongs to class cj ; the similarity of this word pattern to each existing cluster is
δqj = (9)
0, otherwise. calculated to decide whether it is combined into an existing
Therefore, we have m word patterns in total. For example, cluster or a new cluster is created. Once a new cluster is
suppose we have four documents d1 , d2 , d3 , and d4 belonging created, the corresponding membership function should be
to c1 , c1 , c2 , and c2 , respectively. Let the occurrences of w1 initialized. On the contrary, when the word pattern is combined
in these documents be 1, 2, 3, and 4, respectively. Then, the into an existing cluster, the membership function of that cluster
word pattern x1 of w1 is: should be updated accordingly.
Let k be the number of currently existing clusters. The
1×1 + 2×1 + 3×0 + 4×0 clusters are G1 , G2 , . . . , Gk , respectively. Each cluster Gj
P (c1 |w1 ) = = 0.3,
1+2+3+4 has mean mj = < mj1 , mj2 , . . . , mjp > and deviation σ j
1×0 + 2×0 + 3×1 + 4×1 = < σj1 , σj2 , . . . , σjp > . Let Sj be the size of cluster Gj .
P (c2 |w1 ) = = 0.7,
1+2+3+4 Initially, we have k = 0. So, no clusters exist at the beginning.
x1 = < 0.3, 0.7 > . (10) For each word pattern xi = < xi1 , xi2 , . . . , xip >, 1 ≤ i ≤ m,
It is these word patterns our clustering algorithm will work we calculate, according to Eq.(13), the similarity of xi to each
on. Our goal is to group the words in W into clusters based existing clusters, i.e.,
on these word patterns. A cluster contains a certain number p

2
 xiq − mjq
of word patterns, and is characterized by the product of p μGj (xi ) = exp − (15)
one-dimensional Gaussian functions. Gaussian functions are q=1
σjq
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
4

for 1 ≤ j ≤ k. We say that xi passes the similarity test on to cluster G1 , and the size of cluster G1 is 4 and the initial
cluster Gj if deviation σ0 is 0.1. Then cluster G1 is modified as follows:
μGj (xi ) ≥ ρ (16) 4×0.4 + 0.3
m11 = = 0.38,
4+1
where ρ, 0 ≤ ρ ≤ 1, is a predefined threshold. If the user 4×0.6 + 0.7
intends to have larger clusters, then he/she can give a smaller m12 = = 0.62,
4+1
threshold. Otherwise, a bigger threshold can be given. As the m1 = < 0.38, 0.62 >,
threshold increases, the number of clusters also increases. Note (4 − 1)(0.2 − 0.1)2 + 4×0.382 + 0.32
that, as usual, the power in Eq.(15) is two [34], [35]. Its value A11 = ,
4
has an effect on the number of clusters obtained. A larger value 4 + 1 4×0.38 + 0.3 2
will make the boundaries of the Gaussian function sharper, B11 = ( ) ,
and more clusters will be obtained for a given threshold. On 4 4+1
σ11 = A11 − B11 + 0.1 = 0.1937,
the contrary, a smaller value will make the boundaries of
(4 − 1)(0.3 − 0.1)2 + 4×0.622 + 0.72
the Gaussian function smoother, and fewer clusters will be A12 = ,
obtained instead. 4
4 + 1 4×0.62 + 0.7 2
Two cases may occur. Firstly, there are no existing fuzzy B12 = ( ) ,
clusters on which xi has passed the similarity test. For this 4 4+1
case, we assume that xi is not similar enough to any existing σ12 = A12 − B12 + 0.1 = 0.2769,
cluster and a new cluster Gh , h = k + 1, is created with σ1 = < 0.1937, 0.2769 >,
m h = xi , σ h = σ 0 (17) S1 = 4 + 1 = 5.

where σ 0 =< σ0 , . . . , σ0 > is a user-defined constant vector. The above process is iterated until all the word patterns have
Note that the new cluster Gh contains only one member, the been processed. Consequently, we have k clusters. The whole
word pattern xi , at this point. Estimating the deviation of a clustering algorithm can be summarized below.
cluster by Eq.(12) is impossible or inaccurate if the cluster Initialization:
contains few members. In particular, the deviation of a new  of original word patterns: m
cluster is 0 since it contains only one member. We cannot  of classes: p
use zero deviation in the calculation of fuzzy similarities. Threshold: ρ
Therefore, we initialize the deviation of a newly created cluster Initial deviation: σ0
by σ 0 , as indicated in Eq.(17). Of course, the number of Initial  of clusters: k = 0
clusters is increased by 1 and the size of cluster Gh , Sh , should Input:
be initialized, i.e., xi = < xi1 , xi2 , . . . , xip >, 1 ≤ i ≤ m
k = k + 1, Sh = 1. (18) Output:
Clusters G1 , G2 , . . . , Gk
Secondly, if there are existing clusters on which xi has passed procedure Self-Constructing-Clustering-Algorithm
the similarity test, let cluster Gt be the cluster with the largest for each word pattern xi , 1 ≤ i ≤ m
membership degree, i.e., temp W = {Gj |μGj (xi ) ≥ ρ, 1 ≤ j ≤ k};
if (temp W == φ)
t = arg max (μGj (xi )). (19)
1≤j≤k A new cluster Gh , h = k + 1, is created by
In this case, we regard xi to be most similar to cluster Gt , Eqs.(17)-(18);
and mt and σ t of cluster Gt should be modified to include xi else let Gt ∈ temp W be the cluster to which xi is
as its member. The modification to cluster Gt is described as closest by Eq.(19);
follows: Incorporate xi into Gt by Eqs.(20)-(24);
endif;
St ×mtj + xij
mtj = , (20) endfor;
S +1
√ t return with the created k clusters;
σtj = A − B + σ0 , (21) endprocedure
(St − 1)(σtj − σ0 )2 + St ×mtj 2 + xij 2
A = , (22) Note that the word patterns in a cluster have a high degree of
St
similarity to each other. Besides, when new training patterns
St + 1 St ×mtj + xij 2
B = ( ) , (23) are considered, the existing clusters can be adjusted or new
St St + 1 clusters can be created, without the necessity of generating
for 1 ≤ j ≤ p, and the whole set of clusters from the scratch.
Note that the order in which the word patterns are fed
St = St + 1. (24)
in influences the clusters obtained. We apply a heuristic to
Eqs.(20)-(21) can be derived easily from Eqs.(11)-(12). Note determine the order. We sort all the patterns, in decreasing
that k is not changed in this case. Let’s give an example, order, by their largest components. Then the word patterns are
following Eq.(10) and Eq.(14). Suppose x1 is most similar fed in this order. In this way, more significant patterns will be
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
5

fed in first and likely become the core of the underlying cluster. belong to a cluster and so it only contributes to a new extracted
For example, let x1 = < 0.1, 0.3, 0.6 >, x2 = < 0.3, 0.3, 0.4 >, feature. In this case, the elements of T in Eq.(25) are defined
and x3 = < 0.8, 0.1, 0.1 > be three word patterns. The largest as follows:
components in these word patterns are 0.6, 0.4, and 0.8, 
1, if j = arg max1≤α≤k (μGα (xi ));
respectively. The sorted list is 0.8, 0.6, 0.4. So the order of tij = (30)
0, otherwise.
feeding is x3 , x1 , x2 . This heuristic seems to work well.
We discuss briefly here the computational cost of our Note that if j is not unique in Eq.(30), one of them is
method and compare it with DC [27], IOC [14], and IG chosen randomly. In the soft-weighting approach, each word
[10]. For an input pattern, we have to calculate the similarity is allowed to contribute to all new extracted features, with the
between the input pattern and every existing cluster. Each degrees depending on the values of the membership functions.
pattern consists of p components where p is the number of The elements of T in Eq.(25) are defined as follows:
classes in the document set. Therefore, in worst case, the time
complexity of our method is O(mkp) where m is the number tij = μGj (xi ). (31)
of original features and k is the number of clusters finally
obtained. For DC, the complexity is O(mkpt) where t is the The mixed-weighting approach is a combination of the hard-
number of iterations to be done. The complexity of IG is weighting approach and the soft-weighting approach. For this
O(mp + mlogm), and the complexity of IOC is O(mkpn) case, the elements of T in Eq.(25) are defined as follows:
where n is the number of documents involved. Apparently, IG
tij = (γ)×tH S
ij + (1 − γ)×tij (32)
is the quickest one. Our method is better than DC and IOC.
where tH S
ij is obtained by Eq.(30) and tij is obtained by Eq.(31),
3.2 Feature Extraction and γ is a user-defined constant lying between 0 and 1. Note
Formally, feature extraction can be expressed in the following that γ is not related to the clustering. It concerns the merge
form: of component features in a cluster into a resulting feature.
The merge can be ‘hard’ or ‘soft’ by setting γ to 1 or 0. By
D = DT (25)
selecting the value of γ, we provide flexibility to the user.
where When the similarity threshold is small, the number of clusters
T
is small and each cluster covers more training patterns. In this
D = [d1 d2 · · · dn ] , (26) case, a smaller γ will favor soft-weighting and get a higher
 T
D = [d1 d2 ··· dn ] , (27) accuracy. On the contrary, when the similarity threshold is
⎡ ⎤
t11 ... t1k large, the number of clusters is large and each cluster covers
⎢ t21 ... t2k ⎥ fewer training patterns. In this case, a larger γ will favor hard-
⎢ ⎥
T = ⎢ .. .. .. ⎥, (28) weighting and get a higher accuracy.
⎣ . . . ⎦
tm1 ... tmk
3.3 Text Classification
with
Given a set D of training documents, text classification can
di = [di1 di2 · · · dim ], be done as follows. We specify the similarity threshold ρ
di = [di1 di2 · · · dik ] for Eq.(16), and apply our clustering algorithm. Assume that
k clusters are obtained for the words in the feature vector
for 1 ≤ i ≤ n. Clearly, T is a weighting matrix. The goal of W. Then we find the weighting matrix T and convert D
feature reduction is achieved by finding an appropriate T such to D by Eq.(25). Using D as training data, a classifier
that k is smaller than m. In the divisive information-theoretic based on support vector machines (SVM) is built. Note that
feature clustering algorithm [27] described in Section 2.2, the any classifying technique other than SVM can be applied.
elements of T in Eq.(25) are binary and can be defined as Joachims [36] showed that SVM is better than other methods
follows: for text categorization. SVM is a kernel method which finds

1, if wi ∈ Wj ; the maximum margin hyperplane in feature space separating
tij = (29)
0, otherwise the images of the training patterns into two groups [37], [38],
[39]. To make the method more flexible and robust, some
where 1 ≤ i ≤ m and 1 ≤ j ≤ k. That is, if a word wi
patterns need not be correctly classified by the hyperplane, but
belongs to cluster Wj , tij is 1; otherwise tij is 0.
the misclassified patterns should be penalized. Therefore, slack
By applying our clustering algorithm, word patterns have
variables ξi are introduced to account for misclassifications.
been grouped into clusters, and words in the feature vector
The objective function and constraints of the classification
W are also clustered accordingly. For one cluster, we have
problem can be formulated as:
one extracted feature. Since we have k clusters, we have k
extracted features. The elements of T are derived based on l
1
the obtained clusters and feature extraction will be done. We min wT w + C ξi (33)
propose three weighting approaches: hard, soft, and mixed. w,b 2 i=1

In the hard-weighting approach, each word is only allowed to s.t. yi wT φ(xi ) + b ≥ 1 − ξi , ξi ≥ 0, i = 1, 2, ..., l
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
6

TABLE 3 Based on DH , DS , or DM , a classifier with two SVMs


Three clusters obtained. is built. Suppose d is an unknown document, and d = <
deviation σ
cluster
G1
size S
3
mean m
< 1, 0 > < 0.5, 0.5 >
0, 1, 1, 1, 1, 1, 0, 1, 1, 1 > . We first convert d to d by Eq.(34).
G2 5 < 0.08, 0.92 > < 0.6095, 0.6095 > Then, the transformed document is obtained as dH = dTH =
G3 2 < 0.5833, 0.4167 > < 0.6179, 0.6179 > < 2, 4, 2 >, dS = dTS = < 2.5591, 4.3478, 3.9964 >, or dM
= dTM = < 2.1118, 4.0696, 2.3993 >. Then the transformed
unknown document is fed to the classifier. For this example,
where l is the number of training patterns, C is a parameter the classifier concludes that d belongs to c2 .
which gives a tradeoff between maximum margin and classifi-
cation error, and yi , being +1 or -1, is the target label of pattern 5 E XPERIMENTAL RESULTS
xi . Note that φ : X → F is a mapping from the input space to
In this section, we present experimental results to show the
the feature space F where patterns are more easily separated,
effectiveness of our fuzzy self-constructing feature clustering
and wT φ(xi ) + b = 0 is the hyperplane to be derived, with w
method. Three well-known datasets for text classification
and b being weight vector and offset, respectively.
research: 20 Newsgroups [1], RCV1 [40], and Cade12 [41], are
An SVM described above can only separate apart two
used in the following experiments. We compare our method
classes, yi = +1 and yi = −1. We follow the idea in [36]
with other three feature reduction methods: IG [10], IOC [14],
to construct an SVM-based classifier. For p classes, we create
and DC [27]. As described in Section 2, IG is one of the
p SVMs, one SVM for each class. For the SVM of class cv ,
state-of-art feature selection approaches, IOC is an incremental
1 ≤ v ≤ p, the training patterns of class cv are treated as
feature extraction algorithm, and DC is a feature clustering
having yi = +1 and the training patterns of the other classes
approach. For convenience, our method is abbreviated as FFC,
are treated as having yi = −1. The classifier is then the
standing for Fuzzy Feature Clustering, in this section. In our
aggregation of these SVMs. Now we are ready for classifying
method, σ0 is set to 0.25. The value was determined by
unknown documents. Suppose d is an unknown document. We
investigation. Note that σ0 is set to the constant 0.25 for all
first convert d to d by
the following experiments. Furthermore, we use H-FFC, S-
d = dT. (34) FFC, and M-FFC to represent hard-weighting, soft-weighting,
and mixed-weighting, respectively. We run a classifier built
Then we feed d to the classifier. We get p values, one from on support vector machines [42], as described in Section 3.3,
each SVM. Then d belongs to those classes with 1 appearing on the extracted features obtained by different methods. We
at the outputs of their corresponding SVMs. For example, use a computer with Intel(R) Core(TM)2 Quad CPU Q6600
consider a case of three classes c1 , c2 , and c3 . If the three 2.40GHz, 4GB of RAM to conduct the experiments. The
SVMs output 1, -1, and 1, respectively, then the predicted programming language used is MATLAB7.0.
classes will be c1 and c3 for this document. If the three SVMs To compare classification effectiveness of each method, we
output -1, 1, and 1, respectively, the predicted classes will be adopt the performance measures in terms of microaveraged
c2 and c3 . precision (MicroP), microaveraged recall (MicroR), microav-
eraged F1 (MicroF1), and microaveraged accuracy (MicroAcc)
4 A N E XAMPLE [4], [43], [44] defined as follows:
We give an example here to illustrate how our method works. Σpi=1 T Pi
Let D be a simple document set, containing 9 documents d1 , M icroP = p (36)
Σi=1 (T Pi + F Pi )
d2 , . . . , d9 of two classes c1 and c2 , with 10 words ‘office’,
Σpi=1 T Pi
‘building’, . . . , ‘fridge’ in the feature vector W, as shown in M icroR = (37)
Σpi=1 (T Pi + F Ni )
Table 1. For simplicity, we denote the ten words as w1 , w2 ,
2M icroP × M icroR
. . . , w10 , respectively. M icroF 1 = (38)
We calculate the ten word patterns x1 , x2 , . . . , x10 M icroP + M icroR
p
according to Eq.(7) and Eq.(8). For example, x6 = < Σi=1 (T Pi + T Ni )
M icroAcc = p (39)
P (c1 |w6 ), P (c2 |w6 ) > and P (c2 |w6 ) is calculated by Eq.(35). Σi=1 (T Pi + T Ni + F Pi + F Ni )
The resulting word patterns are shown in Table 2. Note that where p is the number of classes. T Pi (true positives wrt ci )
each word pattern is a two-dimensional vector since there are is the number of ci test documents that are correctly classified
two classes involved in D. to ci . T Ni (true negatives wrt ci ) is the number of non-
We run our self-constructing clustering algorithm, by setting ci test documents that are classified to non-ci . F Pi (false
σ0 = 0.5 and ρ = 0.64, on the word patterns and obtain 3 positives wrt ci ) is the number of non-ci test documents that
clusters G1 , G2 , and G3 which are shown in Table 3. The are incorrectly classified to ci . F Ni (false negatives wrt ci ) is
fuzzy similarity of each word pattern to each cluster is shown the number of ci test documents that are classified to non-ci .
in Table 4. The weighting matrices TH , TS , and TM obtained
by hard-weighting, soft-weighting, and mixed-weighting (with
γ = 0.8), respectively, are shown in Table 5. The transformed 5.1 Experiment 1: 20 Newsgroups Dataset
data sets DH , DS , and DM obtained by Eq.(25) for different The 20 Newsgroups collection contains about 20,000 articles
cases of weighting are shown in Table 6. taken from the Usenet newsgroups. These articles are evenly
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
7

TABLE 1
A simple document set D.
office building line floor bedroom kitchen apartment internet WC fridge
(w1 ) (w2 ) (w3 ) (w4 ) (w5 ) (w6 ) (w7 ) (w8 ) (w9 ) (w10 ) class
d1 0 1 0 0 1 1 0 0 0 1 c1
d2 0 0 0 0 0 2 1 1 0 0 c1
d3 0 0 0 0 0 0 1 0 0 0 c1
d4 0 0 1 0 2 1 2 1 0 1 c1
d5 0 0 0 1 0 1 0 0 1 0 c2
d6 2 1 1 0 0 1 0 0 1 0 c2
d7 3 2 1 3 0 1 0 1 1 0 c2
d8 1 0 1 1 0 1 0 0 0 0 c2
d9 1 1 1 1 0 0 0 0 0 0 c2

1×0 + 2×0 + 0×0 + 1×0 + 1×1 + 1×1 + 1×1 + 1×1 + 0×1


P (c2 |w6 ) = = 0.50. (35)
1+2+0+1+1+1+1+1+0

TABLE 2
Word patterns of W.
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
0.00 0.20 0.20 0.00 1.00 0.50 1.00 0.67 0.00 1.00
1.00 0.80 0.80 1.00 0.00 0.50 0.00 0.33 1.00 0.00

TABLE 4
Fuzzy similarities of word patterns to three clusters.
similarity x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
μG1 (x) 0.0003 0.0060 0.0060 0.0003 1.0000 0.1353 1.0000 0.4111 0.0003 1.0000
μG2 (x) 0.9661 0.9254 0.9254 0.9661 0.0105 0.3869 0.0105 0.1568 0.9661 0.0105
μG3 (x) 0.1682 0.4631 0.4631 0.1682 0.4027 0.9643 0.4027 0.9643 0.1682 0.4027

5
10
FFC
of clusters. For the other methods, the number of extracted
DC
4
10
IG features should be specified in advance by the user. Table 7
IOC
lists values of certain points in Figure 2. Different values
Execution time (sec)

3
10 of ρ are used in FFC and are listed in the table. Note that
2
for IG, each word is given a weight. The words of top k
weights in W are selected as the extracted features in W .
10

1
10 Therefore, the execution time is basically the same for any
0
value of k. Obviously, our method runs much faster than DC
10
20 50 100 200
Number of extracted features
500 1000 1453
and IOC. For example, our method needs 8.61 seconds for
20 extracted features, while DC requires 88.38 seconds and
Fig. 2. Execution time (sec) of different methods on 20 IOC requires 6943.40 seconds. For 84 extracted features, our
Newsgroups data. method only needs 17.68 seconds, but DC and IOC require
293.98 and 28098.05 seconds, respectively. As the number
of extracted features increases, DC and IOC run significantly
slow. In particular, when the number of extracted features
distributed over 20 classes and each class has about 1,000 arti- exceeds 280, IOC spends more than 100,000 seconds without
cles, as shown in Figure 1(a). In this figure, the x-axis indicates getting finished, as indicated by dashes in Table 7.
the class number and the y-axis indicates the fraction of the
articles of each class. We use two-thirds of the documents for Figure 3 shows the MicroAcc (%) of the 20 Newsgroups
training and the rest for testing. After preprocessing, we have data set obtained by different methods, based on the extracted
25,718 features, or words, for this data set. features previously obtained. The vertical axis indicates the
Figure 2 shows the execution time (sec) of different feature MicroAcc values (%) and the horizontal axis indicates the
reduction methods on the 20 Newsgroups dataset. Since H- number of extracted features. Table 8 lists values of ceratin
FFC, S-FFC, and M-FFC have the same clustering phase, they points in Figure 3. Note that γ is set to 0.5 for M-FFC. Also,
have the same execution time and thus we use FFC to denote no accuracies are listed for IOC when the number of extracted
them in Figure 2. In this figure, the horizontal axis indicates features exceeds 280. As shown in the figure, IG performs the
the number of extracted features. To obtain different numbers worst in classification accuracy, especially when the number of
of extracted features, different values of ρ are used in FFC. extracted features is small. For example, IG only gets 95.79%
The number of extracted features is identical to the number in accuracy for 20 extracted features. S-FFC works very well
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
8

TABLE 5
Weighting matrices: hard TH , soft TS , and mixed TM .
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
0 1 0 0.0003 0.9661 0.1682 0.0001 0.9932 0.0336
⎢ 0 1 0 ⎥ ⎢ 0.0060 0.9254 0.4631 ⎥ ⎢ 0.0012 0.9851 0.0926 ⎥
⎢ 0 1 0 ⎥ ⎢ 0.0060 0.9254 0.4631 ⎥ ⎢ 0.0012 0.9851 0.0926 ⎥
⎢ 0 1 0 ⎥ ⎢ 0.0003 0.9661 0.1682 ⎥ ⎢ 0.0001 0.9932 0.0336 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ 1 0 0 ⎥ ⎢ 1.0000 0.0105 0.4027 ⎥ ⎢ 1.0000 0.0021 0.0805 ⎥
TH = ⎢ ⎥ , TS = ⎢ ⎥ , TM = ⎢ ⎥.
⎢ 0 0 1 ⎥ ⎢ 0.1353 0.3869 0.9643 ⎥ ⎢ 0.0271 0.0774 0.9929 ⎥
⎢ 1 0 0 ⎥ ⎢ 1.0000 0.0105 0.4027 ⎥ ⎢ 1.0000 0.0021 0.0805 ⎥
⎢ 0 0 1 ⎥ ⎢ 0.4111 0.1568 0.9643 ⎥ ⎢ 0.0822 0.0314 0.9929 ⎥
⎣ ⎦ ⎣ ⎦ ⎣ ⎦
0 1 0 0.0003 0.9661 0.1682 0.0001 0.9932 0.0336
1 0 0 1.0000 0.0105 0.4027 1.0000 0.0021 0.0805

TABLE 6
Transformed data sets: hard DH , soft DS , and mixed DM .
(w1 ) (w2 ) (w3 ) (w1 ) (w2 ) (w3 ) (w1 ) (w2 ) (w3 )
d1 2 1 1 d1 2.1413 1.3333 2.2327 d1 2.0283 1.0667 1.2465
d2 1 0 3 d2 1.6818 0.9411 3.2955 d2 1.1364 0.1882 3.0591
d3 1 0 0 d3 1.0000 0.0105 0.4027 d3 1.0000 0.0021 0.0805
d4 5 1 2 d4 5.5524 1.5217 4.4051 d4 5.1105 1.1043 2.4810
d5 0 2 1 d5 0.1360 2.3192 1.3006 d5 0.0272 2.0638 1.0601
d6 0 5 1 d6 0.1483 5.1362 2.3949 d6 0.0297 5.0272 1.2790
d7 0 10 2 d7 0.5667 10.0829 4.4950 d7 0.1133 10.0166 2.4990
d8 0 3 1 d8 0.1420 3.2446 1.7637 d8 0.0284 3.0489 1.1527
d9 0 4 0 d9 0.0126 3.7831 1.2625 d9 0.0025 3.9566 0.2525
DH DS DM

0.06 0.16 0.25

0.14
0.05
0.2
0.12
Ratio of the class to all

Ratio of the class to all

Ratio of the class to all


0.04
0.1
0.15

0.03 0.08

0.1
0.06
0.02

0.04
0.05
0.01
0.02

0 0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 20 40 60 80 100 1 2 3 4 5 6 7 8 9 10 11 12
Classes of 20 Newsgroups Classes of RCV1 Classes of Cade12

(a) 20 Newsgroups (b) RCV1 (c) Cade12

Fig. 1. Class distributions of three datasets.

TABLE 7
Sampled execution times (sec) of different methods on 20 Newsgroups data.
 of extracted
features 20 58 84 120 203 280 521 1187 1453
threshold (ρ) (0.01) (0.02) (0.03) (0.06) (0.12) (0.19) (0.23) (0.32) (0.36)
IG 19.98 19.98 19.98 19.98 19.98 19.98 19.98 19.98 19.98
DC 88.38 204.90 293.98 486.57 704.24 972.69 1973.3 3425.04 5012.79
IOC 6943.40 19397.91 28098.05 39243.00 67513.52 93010.60 ——— ——— ———
FFC 8.61 13.95 17.68 23.44 39.36 55.30 79.79 155.99 185.24

when the number of extracted features is smaller than 58. 25,718 features are used for classification, the accuracy is
For example, S-FFC gets 98.46% in accuracy for 20 extracted 98.45%.
features. H-FFC and M-FFC perform well in accuracy all the Table 9 lists values of the MicroP, MicroR, and MicroF1 (%)
time, except for the case of 20 extracted features. For example, of the 20 Newsgroups dataset obtained by different methods.
H-FFC and M-FFC get 98.43% and 98.54%, respectively, in Note that γ is set to 0.5 for M-FFC. From this table, we can
accuracy for 58 extracted features, 98.63% and 98.56% for 203 see that S-FFC can, in general, get best results for MicroF1,
extracted features, and 98.65% and 98.59% for 521 extracted followed by M-FFC, H-FFC, and DC in order. For MicroP, H-
features. Although the number of extracted features is affected FFC performs a little bit better than S-FFC. But for MicroR,
by the threshold value, the classification accuracies obtained S-FFC performs better than H-FFC by a greater margin than in
keep fairly stable. DC and IOC perform a little bit worse the MicroP case. M-FFC performs well for MicroP, MicroR,
in accuracy when the number of extracted features is small. and MicroF1. For MicroF1, M-FFC performs only a little bit
For example, they get 97.73% and 97.09%, respectively, in worse than S-FFC, with almost identical performance when
accuracy for 20 extracted features. Note that while the whole the number of features is less than 500. For MicroP, M-FFC
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
9

TABLE 8
Microaveraged accuracy (%) of different methods for 20 Newsgroups data.
 of extracted features 20 58 84 120 203 280 521 1187 1453
threshold (ρ) (0.01) (0.02) (0.03) (0.06) (0.12) (0.19) (0.23) (0.32) (0.36)
microaveraged IG 95.79 96.20 96.32 96.49 96.81 96.94 97.33 97.86 97.95
accuracy DC 97.73 98.19 98.38 98.36 98.43 98.56 98.28 98.63 98.26
(MicroAcc) IOC 97.09 97.44 97.50 97.51 97.68 97.62 —- —- —-
H-FFC 97.94 98.43 98.44 98.41 98.63 98.59 98.65 98.61 98.40
S-FFC 98.46 98.57 98.62 98.59 98.64 98.69 98.69 98.70 98.72
M-FFC 98.10 98.54 98.57 98.58 98.56 98.60 98.59 98.60 98.62
Full features (25718 features): MicroAcc = 98.45

TABLE 9
Microaveraged Precision, Recall, and F1 (%) of different methods for 20 Newsgroups data.
 of extracted features 20 58 84 120 203 280 521 1187 1453
threshold (ρ) (0.01) (0.02) (0.03) (0.06) (0.12) (0.19) (0.23) (0.32) (0.36)
microaveraged IG 91.91 90.77 89.78 89.07 89.53 90.29 89.57 91.19 91.17
precision DC 88.00 90.97 91.69 91.90 93.07 92.51 93.28 93.82 94.27
(MicroP) IOC 86.56 87.07 87.52 88.41 88.22 89.28 —- —- —-
H-FFC 90.87 92.35 92.55 92.59 91.79 92.01 91.37 92.05 92.90
S-FFC 90.81 91.99 91.80 91.79 91.65 92.25 91.07 91.42 91.80
M-FFC 89.65 91.69 91.92 92.15 92.34 92.71 92.58 92.56 92.90
microaveraged IG 17.75 27.11 30.22 34.16 41.26 43.72 52.96 63.45 65.46
recall DC 62.88 70.80 74.38 73.76 74.26 77.54 77.91 77.70 77.85
(MicroR) IOC 48.92 56.87 57.86 57.48 61.56 59.30 —- —- —-
H-FFC 65.21 74.67 74.96 74.27 79.84 78.73 80.61 78.95 78.43
S-FFC 76.98 78.16 79.40 78.72 80.05 80.58 81.84 81.62 81.70
M-FFC 70.25 77.71 78.32 78.32 77.76 78.17 78.11 78.36 78.42
microaveraged IG 29.76 41.77 45.23 49.39 56.50 58.92 66.57 74.85 76.22
F1 DC 73.36 79.64 82.15 81.85 82.62 84.38 84.91 85.02 85.28
(MicroF1) IOC 62.53 68.82 69.68 67.66 72.53 71.28 —- —- —-
H-FFC 75.93 82.59 82.84 82.44 85.40 84.85 85.65 85.00 85.05
S-FFC 83.34 84.53 85.17 84.77 85.46 86.04 86.22 86.25 86.46
M-FFC 78.77 84.93 85.03 85.54 85.72 85.67 85.59 84.87 85.05
Full features (25718 features): MicroP = 94.53, MicroR = 73.18, MicroF1 = 82.50

99 90

88
98.5
86
98
Microaveraged Accuracy %

84
Microaveraged F1 %

97.5
82

97 80

78
96.5 H−FFC
S−FFC 76
96 M−FFC
DC 74
M−FFC20
95.5 IG
72 M−FFC203
IOC
M−FFC1453
95 70
20 50 100 200 500 1000 1500 25718 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Number of extracted features Parameter of M−FFC

Fig. 3. Microaveraged accuracy (%) of different methods Fig. 4. Microaveraged F1 (%) of M-FFC with different γ
for 20 Newsgroups data. values for 20 Newsgroups data.

performs better than S-FFC most of the time. H-FFC performs


well when the number of features is less than 200, while
it performs worse than DC and M-FFC when the number In summary, IG runs very fast in feature reduction, but
of features is greater than 200. DC performs well when the the extracted features obtained are not good for classifica-
number of features is greater than 200, but it performs worse tion. FFC can not only run much faster than DC and IOC
when the number of features is less than 200. For example, in feature reduction, but also provide comparably good or
DC gets 88.00% and M-FFC gets 89.65% when the number better extracted features for classification. Figure 4 shows
of features is 20. For MicroR, S-FFC performs best for all the influence of γ values on the performance of M-FFC in
the cases. M-FFC performs well, especially when the number MicroF1 for three numbers of extracted features, 20, 203, and
of features is less than 200. Note that while the whole 25,718 1,453, indicated by M-FFC20, M-FFC203, and M-FFC1453,
features are used for classification, MicroP is 94.53%, MicroR respectively. As shown in the figure, the MircoF1 values do
is 73.18%, and MicroF1 is 82.50%. not vary significantly with different settings of γ.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
10

5
10
FFC
75
DC
IG
10
4
IOC 70

Microaveraged F1 %
65
Execution time (sec)

3
10
60

2
10 55

50
1
10 M−FFC18
45 M−FFC202
M−FFC1700
0
10 40
18 50 100 200 500 1000 1700 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Number of extracted features Parameter of M−FFC

Fig. 5. Execution time (sec) of different methods on RCV1 Fig. 7. Microaveraged F1(%) of M-FFC with different γ
data. values for RCV1 data.

99

98.5
Microaveraged Accuracy %

98
worst in classification accuracy, especially when the number
97.5 of extracted features is small. For example, IG only gets
H−FFC
S−FFC
M−FFC
96.95% in accuracy for 18 extracted features. H-FFC, S-
97
DC
IG
FFC, and M-FFC perform well in accuracy all the time. For
IOC
96.5
18 50 100 200 500 1000 170047152
example, H-FFC, S-FFC, and M-FFC get 98.03%, 98.26%,
Number of extracted features
and 97.85%, respectively, in accuracy for 18 extracted features,
98.13%, 98.39%, and 98.31%, respectively, for 65 extracted
Fig. 6. Microaveraged accuracy (%) of different methods
features, and 98.41%, 98.53%, and 98.56%, respectively, for
for RCV1 data.
421 extracted features. Note that while the whole 47,152
features are used for classification, the accuracy is 98.83%.
Table 12 lists values of the MicroP, MicroR, and MicroF1 (%)
5.2 Experiment 2: REUTERS CORPUS VOLUME 1 of the RCV1 dataset obtained by different methods. From this
(RCV1) Dataset table, we can see that S-FFC can, in general, get best results
The RCV1 dataset consists of 804,414 news stories produced for MicroF1, followed by M-FFC, DC, and H-FFC in order. S-
by Reuters from 20 Aug 1996 to 19 Aug 1997. All news stories FFC performs better than the other methods, especially when
are in English and have 109 distinct terms per document on the number of features is less than 500. For MicroP, all the
average. The documents are divided, by the ”LYRL2004” split methods perform about equally well, the values being around
defined in Lewis et al. [40], into 23,149 training documents 80%-85%. IG gets a high MicroP when the number of features
and 781,265 testing documents. There are 103 Topic categories is smaller than 50. M-FFC performs well for MicroP, MicroR,
and the distribution of the documents over the classes is shown and MicroF1. Note that while the whole 47,152 features are
in Figure 1(b). All the 103 categories have one or more positive used for classification, MicroP is 86.66%, MicroR is 75.03%,
test examples in the test set, but only 101 of them have one and MicroF1 is 80.43%.
or more positive training examples in the training set. It is
these 101 categories that we use in this experiment. After pre- In summary, IG runs very fast in feature reduction, but the
processing, we have 47,152 features for this data set. extracted features obtained are not good for classification. For
Figure 5 shows the execution time (sec) of different feature example, IG gets 96.95% in MicroAcc, 91.58% in MicroP,
reduction methods on the RCV1 dataset. Table 10 lists values 5.80% in MicroR, and 10.90% in MicroF1 for 18 extracted
of certain points in Figure 5. Clearly, our method runs much features, and 97.24% in MicroAcc, 68.82% in MicroP, 25.72%
faster than DC and IOC. For example, for 18 extracted fea- in MicroR, and 37.44% in MicroF1 for 65 extracted features.
tures, DC requires 609.63 seconds and IOC requires 94533.88 FFC can not only run much faster than DC and IOC in
seconds, but our method only needs 14.42 seconds. For 65 feature reduction, but also provide comparably good or better
extracted features, our method only needs 37.40 seconds, extracted features for classification. For example, S-FFC gets
but DC and IOC require 1709.77 and over 100,000 seconds, 98.26% in MicroAcc, 85.18% in MicroP, 55.33% in MicroR,
respectively. As the number of extracted features increases, and 67.08% in MicroF1 for 18 extracted features, and 98.39%
DC and IOC run significantly slow. in MicroAcc, 82.34% in MicroP, 63.41% in MicroR, and
Figure 6 shows the MicroAcc (%) of the RCV1 dataset 71.65% in MicroF1 for 65 extracted features. Figure 7 shows
obtained by different methods, based on the extracted features the influence of γ values on the performance of M-FFC in
previously obtained. Table 11 lists values of certain points MicroF1 for three numbers of extracted features, 18, 202, and
in Figure 6. Note that γ is set to 0.5 for M-FFC. No 1700, indicated by M-FFC18, M-FFC202, and M-FFC1700,
accuracies are listed for IOC when the number of extracted respectively. As shown in the figure, the MircoF1 values do
features exceeds 18. As shown in the figure, IG performs the not vary significantly with different settings of γ.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
11

TABLE 10
Sampled execution times (sec) of different methods on RCV1 data.
 of extracted features 18 42 65 74 202 421 818 1700
threshold (ρ) (0.01) (0.03) (0.07) (0.1) (0.2) (0.3) (0.4) (0.5)
IG 9.44 9.44 9.44 9.44 9.44 9.44 9.44 9.44
DC 609.63 1329.64 1709.77 1980.48 4256.1 8501.36 16592.09 21460.46
IOC 94533.88 ——— ——— ——— ——— ——— ——— ———
FFC 14.42 31.13 37.40 46.57 107.87 181.60 327.04 612.51

TABLE 11
Microaveraged Accuracy (%) of different methods for RCV1 data.
 of extracted features 18 42 65 74 202 421 818 1700
threshold (ρ) (0.01) (0.03) (0.07) (0.1) (0.2) (0.3) (0.4) (0.5)
microaveraged IG 96.95 97.13 97.24 97.26 97.60 98.03 98.37 98.69
accuracy DC 98.03 98.07 98.19 98.29 98.39 98.51 98.58 98.67
(MicroAcc) IOC 96.98 —- —- —- —- —- —- —-
H-FFC 98.03 98.02 98.13 98.14 98.31 98.41 98.49 98.54
S-FFC 98.26 98.37 98.39 98.41 98.58 98.53 98.57 98.58
M-FFC 97.85 98.28 98.31 98.38 98.49 98.56 98.58 98.62
Full features (47152 features): MicroAcc = 98.83

TABLE 12
Microaveraged Precision, Recall and F1 (%) of different methods for RCV1 data.
 of extracted features 18 42 65 74 202 421 818 1700
threshold (ρ) (0.01) (0.03) (0.07) (0.1) (0.2) (0.3) (0.4) (0.5)
microaveraged IG 91.58 92.00 68.82 69.46 76.25 83.74 87.31 88.94
precision DC 79.87 79.96 82.82 83.58 83.75 85.38 86.51 87.57
(MicroP) IOC 61.89 —- —- —- —- —- —- —-
H-FFC 81.91 78.35 79.74 79.68 81.42 83.03 84.42 84.31
S-FFC 85.18 82.96 82.34 82.68 87.28 84.73 85.43 85.52
M-FFC 71.42 82.50 81.58 82.16 87.32 86.78 85.51 86.21
microaveraged IG 5.80 11.82 25.72 26.32 36.66 48.16 57.67 67.78
recall DC 51.72 53.27 54.93 58.02 61.75 64.81 66.23 68.16
(MicroR) IOC 15.68 —- —- —- —- —- —- —-
H-FFC 49.45 53.05 55.92 56.50 61.32 63.49 64.91 66.83
S-FFC 55.33 61.91 63.41 63.76 65.33 66.11 66.81 67.30
M-FFC 55.53 58.85 61.30 63.18 61.86 65.30 67.27 68.00
microaveraged IG 10.90 20.94 37.44 38.17 49.51 61.15 69.46 76.93
F1 DC 62.79 63.94 66.05 68.49 71.08 73.68 75.02 76.66
(MicroF1) IOC 25.01 —- —- —- —- —- —- —-
H-FFC 61.67 63.27 65.74 66.12 69.96 71.96 73.39 74.56
S-FFC 67.08 70.91 71.65 72.00 74.73 74.27 74.98 75.32
M-FFC 62.48 68.69 70.00 71.43 72.42 74.52 75.30 76.03
Full features (47152 features): MicroP = 86.66, MicroR = 75.03, MicroF1 = 80.43

TABLE 13
The number of documents contained in each class of the Cade12 dataset.
class 01–servicos 02–sociedade 03–lazer 04–informatica 05–saude 06–educacao
 of documents 8473 7363 5590 4519 3171 2856
class 07–internet 08–cultura 09–esportes 10–noticias 11–ciencias 12–compras-online
 of documents 2381 2137 1907 1082 879 625

5.3 Experiment 3: Cade12 Data Figure 8 shows the execution time (sec) of different feature
reduction methods on the Cade12 dataset. Table 14 lists values
Cade12 is a set of classified Web pages extracted from the of certain points in Figure 8. Clearly, our method runs much
Cadê Web directory [41]. This directory points to Brazilian faster than DC and IOC. For example, for 22 extracted fea-
Web pages that were classified by human experts into 12 tures, DC requires 689.63 seconds and IOC requires 57958.10
classes. The Cade12 collection has a skewed distribution and seconds, but our method only needs 24.02 seconds. For 84
the three most popular classes represent more than 50% of extracted features, our method only needs 47.48 seconds, while
all documents. A version of this dataset, 40,983 documents DC requires 2939.35 seconds and IOC has a difficulty in
in total with 122,607 features, was obtained from [45], from completion. As the number of extracted features increases, DC
which two-thirds, 27,322 documents, are split for training and and IOC run significantly slow. In particular, IOC can only
the remaining, 13,661 documents, for testing. The distribution get finished in 100,000 seconds with 22 extracted features, as
of Cade12 data is shown in Figure 1(c) and Table 13.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
12

TABLE 14
Sampled execution times (sec) of different methods on Cade12 data.
 of extracted
features 22 38 59 84 108 286 437 889 1338
threshold (ρ) (0.01) (0.03) (0.07) (0.14) (0.19) (0.28) (0.37) (0.48) (0.54)
IG 105.18 105.18 105.18 105.18 105.18 105.18 105.18 105.18 105.18
DC 689.63 1086.91 1871.67 2939.35 4167.23 7606.41 11593.84 23509.45 30331.91
IOC 57958.10 ——— ——— ——— ——— ——— ——— ——— ———
FFC 24.02 27.61 34.40 47.48 57.55 100.18 145.54 276.47 381.72

5
10 55
FFC
DC 54
IG
IOC 53
4
10
52

Microaveraged F1 %
Execution time (sec)

51
3
10 50

49

48
2
10
47 M−FFC22
M−FFC286
46
M−FFC1338
1
10 45
20 50 100 200 500 10001338 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Number of extracted features Parameter of M−FFC

Fig. 8. Execution time (sec) of different methods on Fig. 10. Microaveraged F1 (%) of M-FFC with different γ
Cade12 data. values for Cade12 data.

94
FFC can get best results, followed by M-FFC, H-FFC, and
DC in order. Note that while the whole 122,607 features
Microaveraged Accuracy %

are used for classification, MicroAcc is 93.55%, MicroP is


93
69.57%, MicroR is 40.11%, and MicroF1 is 50.88%. Again,
H−FFC
this experiment shows that FFC can not only run much faster
92 S−FFC
M−FFC
than DC and IOC in feature reduction, but also provide
DC
IG comparably good or better extracted features for classification.
IOC
91
20 50 100 200 500 1,000 122607
Figure 10 shows the influence of γ values on the performance
Number of extracted features
of M-FFC in MicroF1 for three numbers of extracted features,
22, 286, and 1,338, indicated by M-FFC22, M-FFC286, and
Fig. 9. Microaveraged accuracy (%) of different methods
M-FFC1338, respectively. As shown in the figure, the MircoF1
for Cade12 data.
values do not vary significantly with different settings of γ.

shown in Table 14. 6 C ONCLUSIONS


Figure 9 shows the MicroAcc (%) of the Cade12 dataset We have presented a fuzzy self-constructing feature clustering
obtained by different methods, based on the extracted features (FFC) algorithm which is an incremental clustering approach
previously obtained. Table 15 lists values of certain points to reduce the dimensionality of the features in text classifica-
in Figure 9. Table 16 lists values of the MicroP, MicroR, tion. Features that are similar to each other are grouped into the
and MicroF1 (%) of the Cade12 dataset obtained by different same cluster. Each cluster is characterized by a membership
methods. Note that γ is set to 0.5 for M-FFC. Also, no function with statistical mean and deviation. If a word is not
accuracies are listed for IOC when the number of extracted similar to any existing cluster, a new cluster is created for
features exceeds 22. As shown in the tables, all the methods this word. Similarity between a word and a cluster is defined
work equally well for MicroAcc. But none of the methods by considering both the mean and the variance of the cluster.
work satisfactorily well for MicroP, MicroR, and MicroF1. When all the words have been fed in, a desired number of
In fact, it is hard to get feature reduction for Cade12. Loose clusters are formed automatically. We then have one extracted
correlation exists among the original features. IG performs feature for each cluster. The extracted feature corresponding
best for MicroP. For example, IG gets 78.21% in MicroP to a cluster is a weighted combination of the words contained
when the number of features is 38. However, IG performs in the cluster. By this algorithm, the derived membership
worst for MicroR, getting only 7.04% in MicroR when the functions match closely with and describe properly the real
number of features is 38. S-FFC, M-FFC, and H-FFC get a distribution of the training data. Besides, the user need not
little bit worse than IG in MicroP, but they are good in MicroR. specify the number of extracted features in advance, and trial-
For example, S-FFC gets 75.83% in MicroP and 38.33% in and-error for determining the appropriate number of extracted
MicroR when the number of features is 38. In general, S- features can then be avoided. Experiments on three real-world
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
13

TABLE 15
Microaveraged Accuracy (%) of different methods for Cade12 data.
 of extracted features 22 38 59 84 108 286 437 889 1338
threshold (ρ) (0.01) (0.03) (0.07) (0.14) (0.19) (0.28) (0.37) (0.48) (0.54)
microaveraged IG 91.89 92.09 92.20 92.24 92.28 92.45 92.51 92.80 92.85
accuracy DC 93.18 93.20 93.17 93.30 93.30 93.59 93.63 93.65 93.68
(MicroAcc) IOC 92.82 —- —- —- —- —- —- —- —-
H-FFC 93.63 93.51 93.46 93.48 93.51 93.63 93.59 93.64 93.62
S-FFC 93.79 93.84 93.80 93.83 93.91 93.89 93.87 93.80 93.77
M-FFC 93.75 93.63 93.54 93.66 93.72 93.89 93.91 93.94 93.95
Full features (122607 features): MicroAcc = 93.55

TABLE 16
Microaveraged Precision, Recall and F1 (%) of different methods for Cade12 data.
 of extracted features 22 38 59 84 108 286 437 889 1338
threshold (ρ) (0.01) (0.03) (0.07) (0.14) (0.19) (0.28) (0.37) (0.48) (0.54)
microaveraged IG 76.33 78.21 78.59 78.16 78.41 76.33 76.02 74.10 73.92
precision DC 70.79 69.22 68.60 69.56 68.60 72.26 72.35 73.36 73.76
(MicroP) IOC 76.26 —- —- —- —- —- —- —- —-
H-FFC 71.16 71.31 71.63 72.31 72.93 72.89 72.38 73.67 73.03
S-FFC 75.10 75.83 74.54 75.60 75.37 73.87 76.07 73.27 75.38
M-FFC 73.78 72.94 72.58 75.12 75.23 75.86 74.58 74.49 74.47
microaveraged IG 3.89 7.04 8.78 9.59 10.21 13.67 14.71 20.82 21.89
recall DC 30.90 33.04 33.18 34.84 36.18 37.48 38.07 37.42 37.52
(MicroR) IOC 20.06 —- —- —- —- —- —- —- —-
H-FFC 39.59 37.05 35.63 35.33 36.66 37.46 37.29 36.82 37.08
S-FFC 38.08 38.33 38.85 38.40 39.98 38.82 38.53 40.35 37.51
M-FFC 38.74 37.37 36.06 35.80 36.71 38.86 41.83 41.47 41.72
microaveraged IG 7.41 12.92 15.80 17.08 18.07 23.18 24.66 32.50 33.78
F1 DC 43.02 44.73 44.73 46.42 47.37 49.36 49.89 49.56 49.74
(MicroF1) IOC 31.76 —- —- —- —- —- —- —- —-
H-FFC 50.88 48.77 47.59 47.47 48.51 49.49 49.22 49.10 49.19
S-FFC 50.53 50.92 51.08 50.93 52.25 52.86 53.02 52.04 50.09
M-FFC 50.80 49.42 48.18 48.49 49.34 52.36 52.85 53.28 53.48
Full features (122607 features): MicroP = 69.57, MicroR = 40.11, MicroF1 = 50.88

datasets have demonstrated that our method can run faster and TABLE 17
obtain better extracted features than other methods. Published evaluation results for some other text
classifiers.
Other projects were done on text classification, with or with- method dataset feature red. evaluation
Joachims [33] 20 NG no 91.80% (ULAcc)
out feature reduction, and evaluation results were published. Lewis et al. [40] RCV1 no 81.6% (MicroF1)
Some methods adopted the same performance measures as we Al-Mubaid 20 NG AIB [25] 98.00% (MicroAcc)
did, i.e., microaveraged accuracy or microaveraged F1. Others & Umair [31] 86.45% (MicroF1)
adopted uni-labeled accuracy (ULAcc). Uni-labeled accuracy Cardoso-Cachopo [45] 20 NG no 82.48% (ULAcc)
Cade12 no 52.84% (ULAcc)
is basically the same as the normal accuracy, except that a
pattern, either training or testing, with m class labels are
copied m times, each copy is associated with a distinct class
label. We discuss some methods here. The evaluation results was reduced to 120. No feature reduction was applied in the
given here for them are taken directly from corresponding pa- other three methods. The evaluation results published for these
pers. Joachims [33] applied the Rocchio classifier over TFIDF- methods are summarized in Table 17. Note that, according
weighted representation and obtained 91.8% uni-labeled accu- to the definitions, uni-labeled accuracy is usually lower than
racy on the 20 Newsgroups dataset. Lewis et al. [40] used Microaveraged accuracy for multi-label classification. This
the SVM classifier over a form of TFIDF weighting and explains why both Joachims [33] and Cardoso-Cachopo [45]
obtained 81.6% microaveraged F1 on the RCV1 dataset. Al- have lower uni-labeled accuracies than the microaveraged
Mubaid & Umair [31] applied a classifier, called Lsquare, accuracies shown in Table 8 and Table 15. However, Lewis
which combines the distributional clustering of words and a et al. [40] has a higher Microaveraged F1 than that shown
learning logic technique and achieved 98.00% microaveraged in Table 11. This may be due to that Lewis et al. [40] used
accuracy and 86.45% microaveraged F1 on the 20 Newsgroups the TFIDF weighting which is more elaborate than the TF
dataset. Cardoso-Cachopo [45] applied the SVM classifier over weighting we used. However, it is hard for us to explain
the normalized TFIDF term weighting, and achieved 52.84% the difference between the microaveraged F1 obtained by Al-
uni-labeled accuracy on the Cade12 dataset and 82.48% on Mubaid & Umair [31] and that shown in Table 9 since a kind
the 20 Newsgroups dataset. Al-Mubaid & Umair [31] applied of sampling was applied in Al-Mubaid & Umair [31].
AIB [25] for feature reduction and the number of features Similarity-based clustering is one of the techniques we have
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
14

developed in our machine learning research. In this paper, we [20] M. Belkin and P. Niyogi, “Laplacian Eigenmaps and Spectral Tech-
apply this clustering technique to text categorization problems. niques for Embedding and Clustering,” Advances in Neural Information
Processing Systems 14, 2002.
We are also applying it to other problems, such as image [21] K. Hiraoka, K. Hidai, M. Hamahira, H. Mizoguchi, T. Mishima, and
segmentation, data sampling, fuzzy modeling, web mining, etc. S. Yoshizawa, “Successive Learning of Linear Discriminant Analysis:
The work of this paper was motivated by distributional word Sanger-Type Algorithm,” 14th International Conference on Pattern
Recognition, pp. 2664-2667, 2000.
clustering proposed in [24], [25], [26], [27], [29], [30]. We [22] J. Weng, Y. Zhang, and W. S. Hwang, “Candid Covariance-Free Incre-
found that when a document set is transformed to a collection mental Principal Component Analysis,” IEEE Transactions on Pattern
of word patterns, as by Eq.(7), the relevance among word Analysis and Machine Intelligence, vol. 25, pp. 1034-1040, 2003.
[23] J. Yan, B. Y. Zhang, S. C. Yan, Z. Chen, W. G. Fan, Q. Yang, W. Y. Ma,
patterns can be measured and the word patterns can be grouped and Q. S. Cheng, “IMMC: Incremental Maximum Margin Criterion,”
by applying our similarity-based clustering algorithm. Our 10th ACM SIGKDD International Conference on Knowledge Discovery
method is good for text categorization problems due to the and Data Mining, pp. 725-730, 2004.
[24] L. D. Baker and A. McCallum, “Distributional Clustering of Words for
suitability of the distributional word clustering concept. Text Classification,” 21st Annual International ACM SIGIR, pp. 96-103,
1998.
[25] R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter, “Distributional
ACKNOWLEDGMENT Word Clusters vs. Words for Text Categorization,” Journal of Machine
Learning Research, vol. 3, pp. 1183-1208, 2003.
The authors are grateful to the anonymous reviewers for their [26] M. C. Dalmau and O. W. M. Flórez, “Experimental Results of the Signal
comments which were very helpful in improving the quality Processing Approach to Distributional Clustering of Terms on Reuters-
21578 Collection,” 29th European Conference on IR Research, pp. 678-
and presentation of the paper. They’d also like to express their 681, 2007.
thanks to National Institute of Standards & Technology for the [27] I. S. Dhillon, S. Mallela, and R. Kumar, “A Divisive Infromation-
provision of the RCV1 dataset. Theoretic Feature Clustering Algorithm for Text Classification,” Journal
of Machine Learning Research, vol. 3, pp. 1265-1287, 2003.
[28] D. Ienco and R. Meo, “Exploration and Reduction of the Feature Space
by Hierarchical Clustering,” 2008 SIAM Conference on Data Mining,
R EFERENCES pp. 577-587, 2008.
[29] N. Slonim and N. Tishby, “The Power of Word Clusters for Text
[1] Http://people.csail.mit.edu/jrennie/20Newsgroups/. Classification,” 23rd European Colloquium on Information Retrieval
[2] Http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html. Research (ECIR), 2001.
[3] H. Kim, P. Howland, and H. Park, “Dimension Reduction in Text [30] F. Pereira, N. Tishby, and L. Lee, “Distributional Clustering of English
Classification with Support Vector Machines,” Journal of Machine Words,” 31st Annual Meeting of ACL, pp. 183-190, 1993.
Learning Research, vol. 6, pp. 37-53, 2005. [31] H. Al-Mubaid and S. A. Umair, “A New Text Categorization Technique
[4] F. Sebastiani, “Machine Learning in Automated Text Categorization,” Using Distributional Clustering and Learning Logic,” IEEE Transactions
ACM Computing Surveys, vol. 34, no. 1, pp. 1-47, 2002. on Knowledge and Data Engineering, vol. 18, no. 9, pp. 1156-1165,
[5] B. Y. Ricardo and R. N. Berthier, Modern Information Retrieval. 2006.
Addison Wesley Longman, 1999. [32] G. Salton and M. J. McGill, Introduction to Modern Retrieval.
[6] A. L. Blum and P. Langley, “Selection of Relevant Features and McGraw-Hill Book Company, 1983.
Examples in Machine Learning,” Aritficial Intelligence, vol. 97, nos. [33] T. Joachims, “A Probabilistic Analysis of the Rocchio Algorithm with
1-2, pp. 245-271, 1997. TFIDF for Text Categorization,” 14th International Conference on
[7] E. F. Combarro, E. Montañés, I. Dı́az, J. Ranilla, and R. Mones, Machine Learning, pp. 143-151, 1997.
“Introducing a Family of Linear Measures for Feature Selection in Text [34] J. Yen and R. Langari, Fuzzy Logic–Intelligence, Control, and Informa-
Categorization,” IEEE Transactions on Knowledge and Data Engineer- tion. Prentice-Hall, Upper Saddle River, NJ, USA, 1999.
ing, vol. 17, no. 9, pp. 1223-1232, 2005. [35] J. S. Wang and C. S. G. Lee, “Self-Adaptive Neurofuzzy Inference
[8] K. Daphne and M. Sahami, “Toward Optimal Feature Selection,” 13th Systems for Classification Applications,” IEEE Transactions on Fuzzy
International Conference on Machine Learning, pp. 284-292, 1996. Systems, vol. 10, no. 6, pp. 790-802, 2002.
[9] R. Kohavi and G. John, “Wrappers for Feature Subset Selection,” [36] T. Joachims, “Text Categorization with Support Vector Machine: Learn-
Aritficial Intelligence, vol. 97, nos. 1-2, pp. 273-324, 1997. ing with Many Relevant Features,” Technical Report LS-8-23, University
[10] Y. Yang and J. O. Pedersen, “A Comparative Study on Feature Selection of Dortmund, Computer Science Department, 1998.
in Text Categorization,” 14th International Conference on Machine [37] C. Cortes and V. Vapnik, “Support-Vector Network,” Machine Learning,
Learning, pp. 412-420, 1997. vol. 20, no. 3, pp. 273-297, 1995.
[11] D. D. Lewis, “Feature Selection and Feature Extraction for Text Catego- [38] B. Schölkopf and A. J. Smola, Learning with Kernels: Support Vector
rization,” Workshop Speech and Natural Language, pp. 212-217, 1992. Machines, Regularization, Optimization, and Beyond. MIT Press,
[12] H. Li, T. Jiang, and K. Zang, “Efficient and Robust Feature Extraction Cambridge, MA, USA, 2001.
by Maximum Margin Criterion,” Conference on Advances in Neural [39] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis.
Information Processing System, pp. 97-104, 2004. Cambridge University Press, Cambridge, UK, 2004.
[13] E. Oja, “Subspace Methods of Pattern Recognition,” Pattern Recognition [40] D. D. Lewis, Y. Yang, T. Rose, and F. Li, “RCV1: A New
and Image Processing Series, vol. 6, 1983. Benchmark Collection for Text Categorization Research,” Jour-
[14] J.Yan, B. Zhang, N. Liu, S. Yan, Q. Cheng, W. Fan, Q. Yang, W. Xi, nal of Machine Learning Research, vol. 5, pp. 361-397, 2004
and Z. Chen, “Effective and Efficient Dimensionality Reduction for (http://www.jmlr.org/papers/volume5/lewis04a/lewis04a.pdf).
Large-Scale and Streaming Data Preprocessing,” IEEE Transactions on [41] The Cadê Web directory, http://www.cade.com.br/.
Knowledge and Data Engineering, vol. 18, no. 3, pp. 320-331, 2006. [42] C. C. Chang and C. J. Lin, “Libsvm: A Library for
[15] I. T. Jolliffe, Principal Component Analysis. Springer-Verlag, 1986. Support Vector Machines,” 2001, software available at
[16] A. M. Martinez and A. C. Kak, “PCA versus LDA,” IEEE Transactions http://www.csie.ntu.edu.tw/∼cjlin/libsvm.
on Pattern Analysis and Machine Intelligence, vol. 23, no. 2, pp. 228- [43] Y. Yang and X. Liu, “A Re-examination of Text Categorization Meth-
233, 2001. ods,” ACM SIGIR Conference, pp. 42-49, 1999.
[17] H. Park, M. Jeon, and J. Rosen, “Lower Dimensional Representation [44] G. Tsoumakas, I. Katakis, and I. Vlahavas, “Mining Multi-label Data,”
of Text Data Based on Centroids and Least Squares,” BIT Numberical Data Mining and Knowledge Discovery Handbook (draft of preliminary
Math, vol. 43, pp. 427-448, 2003. accepted chapter, O. Maimon, L. Rokach (Ed.), Springer, 2nd edition,
[18] S. T. Roweis and L. K. Saul, “Nonlinear Dimensionality Reduction by 2009.
Locally Linear Embedding,” Science, vol. 290, pp. 2323-2326, 2000. [45] Http://web.ist.utl.pt/∼acardoso/datasets/.
[19] J. B. Tenenbaum, V. de Silva, and J. C. Langford, “A Global Geometric
Framework for Nonlinear Dimensionality Reduction,” Science, vol. 290,
pp. 2319-2323, 2000.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
15

Jung-Yi Jiang Jung-Yi Jiang was born in


Changhua, Taiwan, ROC., in 1979. He received
the B.S. degree from I-Shou University, Taiwan
in 2002 and M.S.E.E. degree from Sun Yat-
Sen University, Taiwan in 2004. He is currently
pursuing the Ph.D. degree at the Department
of Electrical Engineering, National Sun Yat-Sen
University. His main research interests include
machine learning, data mining, and information
retrieval.

Ren-Jia Liou Ren-Jia Liou was born in Ban-


qiao, Taipei County, Taiwan, ROC., in 1983. He
received the B.S. degree from National Dong
Hwa University, Taiwan in 2005 and M.S.E.E.
degree from Sun Yat-Sen University, Taiwan in
2009. He is currently a research assistant at
the Department of Chemistry, National Sun Yat-
Sen University. His main research interests in-
clude machine learning, data mining and web
programming.

Shie-Jue Lee Shie-Jue Lee was born at Kin-


Men, ROC on August 15, 1955. He received
the B.S. and M.S. degrees of Electrical Engi-
neering in 1977 and 1979, respectively, from
National Taiwan University, and the Ph.D. de-
gree of Computer Science from the University
of North Carolina, Chapel Hill, USA, in 1990.
Dr. Lee joined the faculty of the Department of
Electrical Engineering at National Sun Yat-Sen
University, Taiwan, in 1983, and has become
a professor of the department since 1994. His
research interests include Artificial Intelligence, Machine Learning, Data
Mining, Information Retrieval, and Soft Computing.
Dr. Lee served as director of the Center for Telecommunications
Research and Development, National Sun Yat-Sen University, 1997-
2000, director of the Southern Telecommunications Research Center,
National Science Council, 1998-1999, and Chair of the Department
of Electrical Engineering, National Sun Yat-Sen University, 2000-2003.
Now he is serving as Deputy Dean of Academic Affairs and director
of NSYSU-III Research Center, National Sun Yat-Sen University. Dr.
Lee obtained the Distinguished Teachers Award of the Ministry of
Education, Taiwan, 1993. He was awarded by the Chinese Institute of
Electrical Engineering for Outstanding M.S. Thesis Supervision, 1997.
He obtained the Distinguished Paper Award of the Computer Society
of the Republic of China, 1998, and Best Paper Award of the 7th
Conference on Artificial Intelligence and Applications, 2002. He obtained
the Distinguished Research Award of National Sun Yat-Sen University,
1998. He obtained the Distinguished Teaching Award of National Sun
Yat-Sen University, 1993 and 2008, respectively. He obtained the Best
Paper Award of the International Conference on Machine Learning
and Cybernetics, 2008. He also obtained the Distinguished Mentor
Award of National Sun Yat-Sen University, 2008. He served as the
program chair for the International Conference on Artificial Intelligence
(TAAI-96), Kaohsiung, Taiwan, December 1996, International Computer
Symposium – Workshop on Artificial Intelligence, Tainan, Taiwan, De-
cember 1998, and the 6th Conference on Artificial Intelligence and
Applications, Kaohsiung, Taiwan, November, 2001. Dr. Lee is a member
of the IEEE Society of Systems, Man, and Cybernetics, the Association
for Automated Reasoning, the Institute of Information and Computing
Machinery, Taiwan Fuzzy Systems Association, and the Taiwanese
Association of Artificial Intelligence.

You might also like