You are on page 1of 5

International Journal of Computer Trends and Technology (IJCTT) volume 11 number 2 May 2014

ISSN: 2231-2803 http://www.ijcttjournal.org Page89


Feature Subset Selection Techniques - Swift
Clustering and Principle Component Analysis
Appurai N.Pai
1
, Smt.T Jayakumari
2
1
Mtech Dept of Computer Science and Engineering, BTL Institute of Technology, Bangalore, India
2
Assistant professor Dept of Computer Science and Engineering, BTL Institute of Technology, Bangalore, India
Abstract-- Feature selection (FS) is a process which attempts to
select more informative features. In some cases, too many
redundant or irrelevant features may overpower main features
for classification. Feature selection can remedy this problem
and therefore improve the prediction accuracy and reduce the
computational overhead of classification algorithms. Irrelevant
features do not contribute to the predictive accuracy, and
redundant features do not contribute to getting a better
predictor for that they provide mostly information which is
already present in other feature(s). Many feature subset
selection methods have been proposed and studied for machine
learning applications. We propose a Swift clustering-based
feature Selection algorithm based on MST and PCA. Features
in different clusters are relatively independent; this clustering-
based strategy of SWIFT has a high probability of producing a
subset of useful and independent features. SWIFT is compared
with PCA in the work and overcomes the drawbacks of PCA .
Keywords: MST, Swift, Symmetric uncertainty, T-Relevance,
F-Correlation, PCA.
MST-MinimumSpanning Tree , Swift-Fast, PCA-principle component
analysis
I. INTRODUCTION
The purpose of the data mining technique is to mine
information from a bulky data set and make over it into a
reasonable form for supplementary purpose. Clustering is a
significant task in data analysis and data mining applications.
It is the task of arrangement a set of objects so that objects in
the identical group are more related to each other than to
those in other groups (clusters). Data mining can do by
passing through various phases. A good clustering method
will produce high superiority clusters with high intra-class
similarity and low inter-class similarity. Feature selection
involves identifying a subset of the most useful features that
produces compatible results as the original entire set of
features. Principal Components Analysis (PCA) is the
predominant linear dimensionality reduction technique, and it
has been widely applied on datasets in all scientific domains.
In words, PCA seeks to map or embed data points from a
high dimensional space to a low dimensional space while
keeping all the relevant linear structure intact. To improve
the efficiency and accuracy of data mining task on high
dimensional data, the data must be preprocessed by an
efficient dimensionality reduction method. Principal
Component Analysis (PCA) is a popular linear feature
extractor used for unsupervised feature selection based on
eigenvectors analysis to identify critical original features for
principal component. PCA is a statistical technique for
determining key variables in a high dimensional data set that
explain the differences in the observations and can be used to
simplify the analysis and visualization of high dimensional
data set, without much loss of information. A feature
selection algorithm may be evaluated from both the
efficiency and effectiveness points of view. While the
efficiency concerns the time required to find a subset of
features, the effectiveness is related to the quality of the
subset of features. Based on these criteria, a swift clustering-
based feature selection algorithm, is proposed and
experimentally evaluated in this work. The SWIFT algorithm
consists of two steps. In the first step, features are divided
into clusters by using graph-theoretic clustering methods. In
the second step, the most representative feature that is
strongly related to target classes is selected from each cluster
to form a subset of features. Features in different clusters are
relatively independent; the clustering-based strategy of
SWIFT has a high probability of producing a subset of useful
and independent features. To ensure the efficiency of
SWIFT, we adopt the efficient minimum-spanning tree
clustering method. The efficiency and effectiveness of the
SWIFT algorithm are evaluated through an empirical study.
Feature selection is a term commonly used in data mining to
describe the tools and techniques available for reducing
inputs to a manageable size for processing and analysis.
Feature selection implies not only cardinality reduction,
which means imposing an arbitrary or predefined cutoff on
the number of attributes that can be considered when
building a model, but also the choice of attributes, meaning
that either the analyst or the modeling tool actively selects or
discards attributes based on their usefulness for analysis. The
ability to apply feature selection is critical for effective
analysis, because datasets frequently contain far more
information than is needed to build the model. If unwanted
International Journal of Computer Trends and Technology (IJCTT) volume 11 number 2 May 2014
ISSN: 2231-2803 http://www.ijcttjournal.org Page90
columns are kept while building the model, more CPU time
and memory are required during the training process. Even if
resources are not an issue, we typically want to remove
unneeded columns because they might degrade the quality of
discovered patterns. Feature selection is applied to inputs,
predictable attributes, or to states in a column. When scoring
for feature selection is complete, only the attributes and states
that the algorithm selects are included in the model-building
process and for feature selection the attribute can still be used
for prediction, but the predictions will be based solely on the
global statistics that exist in the model.
Many feature subset selection methods have been proposed
and studied for machine learning applications. They can be
divided into four broad categories: the Embedded, Wrapper,
Filter, and Hybrid approaches. The embedded methods
incorporate feature selection as a part of the training process
and are usually specific to given learning algorithms, and
therefore may be more efficient than the other three
categories. Wrapper methods use a predictive model to score
feature subsets. Each new subset is used to train a model,
which is tested on a hold-out set. Filter methods use a proxy
measure instead of the error rate to score a feature subset.
This measure is chosen to be fast to compute, whilst still
capturing the usefulness of the feature set. Filters are usually
less computationally intensive than wrappers, but they
produce a feature set which is not tuned to a specific type of
predictive model. Hybrid methods are a combination of filter
and wrapper methods by using a filter method to reduce
search space that will be considered by the subsequent
wrapper. Thus, we will focus on the Hybrid method in this
paper.
II. RELATED WORK
A feature selection algorithm can be seen as the combination
of a search technique for proposing new feature subsets,
along with an evaluation measure which scores the different
feature subsets. The simplest algorithm is to test each
possible subset of features finding the one which minimizes
the error rate. This is an exhaustive search of the space, and
is computationally intractable for all but the smallest of
feature sets. Feature subset selection can be viewed as the
process of identifying and removing as many irrelevant and
redundant features as possible. This is because: (i) irrelevant
features do not contribute to the predictive accuracy [20], and
(ii) redundant features do not redound to getting a better
predictor for that they provide mostly information which is
already present in other feature(s).Of the many feature subset
selection algorithms, some can effectively eliminate
irrelevant features but fail to handle redundant features [16],
yet some of others can eliminate the irrelevant while taking
care of the redundant features . The proposed SWIFT
algorithm falls into the second group.
Traditionally, feature subset selection research has focused
on searching for relevant features. A well known example is
Relief , which weighs each feature according to its ability to
discriminate instances under different targets based on
distance-based criteria function. However, Relief is
ineffective at removing redundant features as two predictive
but highly correlated features are likely both to be highly
weighted . Relief-F extends Relief, enabling this method to
work with noisy and incomplete data sets and to deal with
multi-class problems, but still cannot identify redundant
features. However, along with irrelevant features, redundant
features also affect the speed and accuracy of learning
algorithms, and thus should be eliminated as well . CFS,
CMIM [15], FCBF [22] are examples that take into
consideration the redundant features. CFS is achieved by the
hypothesis that a good feature subset is one that contains
features highly correlated with the target, yet uncorrelated
with each other. FCBF [22] is a fast filter method which can
identify relevant features as well as redundancy among
relevant features without pair wise correlation analysis.
CMIM [15] iteratively picks features which maximize their
mutual information with the class to predict, conditionally to
the response of any feature already picked. Different from
these algorithms, the proposed SWIFT algorithm employs
clustering based method to choose features.
Recently, hierarchical clustering has been adopted in word
selection in the context of text classification (e.g.,[4], and
[13]). Distributional clustering has been used to cluster words
into groups based either on their participation in particular
grammatical relations with other words by Pereira et al. or on
the distribution of class labels associated with each word by
Baker and McCallum [4]. As distributional clustering of
words are agglomerative in nature, and result in sub-optimal
word clusters and high computational cost, Dhillon et al. [13]
proposed a new information-theoretic divisive algorithm for
word clustering and applied it to text classification.
Butterworth et al. [6] proposed to cluster features using a
special metric of Barthelemy-Montjardet distance, and then
makes use of the dendrogram of the resulting cluster
hierarchy to choose the most relevant attributes.
Unfortunately, the cluster evaluation measure based on
Barthelemy-Montjardet distance does not identify a feature
subset that allows the classifiers to improve their original
performance accuracy. Furthermore, even compared with
other feature selection methods, the obtained accuracy is
lower. Hierarchical clustering also has been used to select
features on spectral data. Van Dijk and Van Hullefor
proposed a hybrid filter/wrapper feature subset selection
algorithm for regression. Krier et al.presented a methodology
combining hierarchical constrained clustering of spectral
variables and selection of clusters by mutual information.
International Journal of Computer Trends and Technology (IJCTT) volume 11 number 2 May 2014
ISSN: 2231-2803 http://www.ijcttjournal.org Page91
Their feature clustering method is similar to that of Van Dijk
and Van Hullefor except that the former forces every cluster
to contain consecutive features only. Both methods employed
agglomerative hierarchical clustering to remove redundant
features.
Quite different from these hierarchical clustering based
algorithms, our proposed SWIFT algorithm uses minimum
spanning tree based method to cluster features. Meanwhile, it
does not assume that data points are grouped around centers
or separated by a regular geometric curve. Moreover, our
proposed SWIFT does not limit to some specific types of
data. With the aim of choosing a subset of good features with
respect to the target concepts, feature subset selection is an
effective way for reducing dimensionality, removing
irrelevant data, increasing learning accuracy, and improving
result comprehensibility.
III. FEATURE SELECTION ALGORITHMS AND FLOW
DIAGRAM
The Irrelevant features, along with redundant features,
severely affect the accuracy of the learning machines. Thus,
feature subset selection should be able to identify and remove
as much of the irrelevant and redundant information as
possible. Moreover, good feature subsets contain features
highly correlated with (predictive of) the class, yet
uncorrelated with (not predictive of) each other. Keeping
these in mind, we developed a novel algorithm which can
efficiently and effectively deal with both irrelevant and
redundant features, and obtain a good feature subset.

Algorithm: PCA
Input: Data Matrix
Output: Reduced set of features
Step-1: X Create N x d data matrix, with one row vector
xn per data point.
Step-2: X subtract mean x from each row vector x
n
in X.
Step-3: covariance matrix of X.
Step-4: Find eigenvectors and eigen values of .
Step-5: PCs the M eigenvectors with largest eigen
values.
Step-6: Output PCs.
Where,
Mean X

=
x
n
for i=1,2,n
Cov(x,y)=
(x-x )(-)
(n-1)
for i=1,2,n
Covariance matrix =_
co:(x,x) co:(x,y)
co:(y,x) co:(y,y)
]
For x and y as attributes of dataset.
ALGORITHM : SWIFT
inputs: D(F
1
, F
2
, ..., Fm, C) - the given data set
0 - the T-Relevance threshold.
output: S - selected feature subset .
//====Part 1: Irrelevant Feature Removal ====
1 for i =1 to m do
2 T-Relevance =SU (Fi, C)
3 if T-Relevance >0 then
4 S =S {Fi};
//====Part 2: Minimum Spanning Tree Construction ====
5 G =NULL; //G is a complete graph
6 for each pair of features {Fi, F } S do
7 F-Correlation =SU (F

, F] )
8 AJJ Fi onJ/or F] to 0 wit F-Correlation os tc wcigt
o tc corrcsponJing cJgc;
9 minSpanTree =Prim (G); //Using Prim Algorithm to
generate the minimum spanning tree
//====Part 3 : Tree Partition and Representative Feature
Selection ====
10 Forest =minSpanTree
11 for each edge Ei] Forest do
12 if SU(Fi, F] ) <SU(Fi, C) SU(Fi, F] ) <SU(F],C)
then
13 Forest =Forest Ei]
14 S =
15 for each tree Ii Forest do
16 F
]
R =argmax FkIi SU(Fk,C)
17 S =S {F
]
};
18 return S

STAGE 1: IRREVALANT FEATURE REMOVAL
Symmetric uncertainty is the measure of uncertainty. The
symmetric uncertainty is defined as follows
Su(X, ) =2 [0oin(X / )/E(X) +E( )].
Where, E(X) is the entropy of a discrete random variable X.
Suppose p(x) is the prior probabilities for all values of X,
E(X) is defined by
E(X) =xX
p(x) log2 p(x)
Gain(X /) is the amount by which the entropy of
decreases. It reflects the additional information about
provided by X and is called the information gain.
Information gain which is given by
0oin(X /) =E(X) E(X / )
0oin(X /) =E( ) E( /X)
Where E(X / ) is the conditional entropy which quantifies
the remaining entropy (i.e. uncertainty) of a random variable
X given that the value of another random variable is
known. Suppose p(x) is the prior probabilities for all values
of X and p(x /y) is the posterior probabilities of X given the
values of , E(X / ) is defined by
International Journal of Computer Trends and Technology (IJCTT) volume 11 number 2 May 2014
ISSN: 2231-2803 http://www.ijcttjournal.org Page92
E(X / ) =yp(y) xX p(x /y) log
2
p(x /y)
Information gain is a symmetrical measure. That is the
amount of information gained about X after observing is
equal to the amount of information gained about after
observing X. This ensures that the order of two variables
(e.g.,(X, ) or (,X)) will not affect the value of the measure.
Symmetric uncertainty treats a pair of variables
symmetrically, it compensates for information gains bias
toward variables with more values and normalizes its value
to the range [0,1]. A value 1 of Su(X, ) indicates that
knowledge of the value of either one completely predicts the
value of the other and the value 0 reveals that X and are
independent. Although the entropy based measure handles
nominal or discrete variables, they can deal with continuous
features as well, if the values are discretized properly in
advance [14]. Given Su(X, ) the symmetric uncertainty of
variables X and , the relevance T-Relevance between a
feature and the target concept C, is defined as follows.
Definition 1: (T-Relevance) The relevance between the
feature Fi F and the target concept C is referred to as the
T-Relevance of Fi and C, and denoted by Su(Fi, C). If
Su(Fi, C) is greater than a predetermined threshold 0, we say
that Fi is a strong T-Relevance feature.
STAGE 2: MINIMUM SPANNING TREE
CONSTRUCTION
Given Su(X, ) the symmetric uncertainty of variables X
and the correlation F-Correlation between a pair of features
can be defined as follows.
Definition 2: (F-Correlation) The correlation between any
pair of features Fi and F] (Fi, F] F i =]) is called the
F-Correlation of Fi and F] , and denoted by Su(Fi, F]).

PRIMS ALGORITHM:
A Minimum Spanning Tree in an undirected connected
weighted graph of a spanning tree of minimum weight
among all spanning trees.
Grow a MST:
Start by picking any vertex to be the root of the tree.
While the tree does not contain all vertices in the
graph
find shortest edge leaving the tree and add it to the
tree .
STAGE 3: SELECTING REPRESENTATIVE FEATURES
The feature redundancy F-Redundancy and the
representative feature R-Feature of a feature cluster can be
defined as follows.
Definition 3: (F-Redundancy) Let S = {F1, F2, ...,
Fi,...,F
k<F} be a cluster of features. If Fj S, Su (F], C )
Su (Fi, C ) Su (Fi, F]) >Su (Fi, C) is always corrected
for each Fi S (i =]), then Fi are redundant features with
respect to the given Fj (i.e. each Fi is a F-Redundancy).
Definition 4: (R-Feature) A feature Fi S ={F1, F2, ..., Fk}
(k <F) is a representative feature of the cluster S ( i.e.Fi is
a R-Feature ) if and only if, Fi =argmaxF]S Su(F], C ).
This means the feature, which has the strongest T-Relevance,
can act as a R-Feature for all the features in the cluster.
According to the above definitions, feature subset selection
can be the process that identifies and retains the strong T-
Relevance features and selects R-Features from feature
clusters.

FLOW DIAGRAM



IV. DATASET DESCRIPTION
We have used two dataset from the UCI repository of which
first is LUNG CANCER dataset with 32 instances and 57
attributes(1 class attribute, 56 predictive) with attribute class
label in which all predictive attributes are nominal, taking on
integer values 0-3 and Class Distribution as class 1 with 9
observation and class 2 with 13 observations and class 3 with
10 observations and second is LIBRAS Movement Database
with 360 (24 in each of fifteen classes) instances and 90
numeric (double) and 1 for the class (integer) attributes with
class distribution of 6.66% for each of 15 classes.

International Journal of Computer Trends and Technology (IJCTT) volume 11 number 2 May 2014
ISSN: 2231-2803 http://www.ijcttjournal.org Page93
V. RESULTS AND DISCUSSION
As proposed the swift clustering algorithm and PCA is
implemented in Net beans IDE. To evaluate this two datasets
have been used and the outputs are tabulated in the Table 1.
Table 1
Selected attributes
DATASET LUNG CANCER LIBRAS
MOVEMENT
ATTRIBUTES 56 90
INSTANCES 32 360
SWIFT
OUTPUT
7 ATTRIBUTES 2 ATTRIBUTES
PCA
OUTPUT
21 ATTRIBUTES 9 ATTRIBUTES

VI. CONCLUSION
Feature selection method is an efficient way to improve the
accuracy of classifiers, dimensionality reduction, removing
both irrelevant and redundant data. Thus SWIFT algorithm
selects only fewer and relevant features which adds to the
classifier accuracy when compared with PCA as shown in
table 1. For the future work, we plan to explore different
types of correlation measures, and study some formal
properties of feature space.

REFERENCES
[1]. UCI Repository https:archive.ics.uci.edu/ml/
[2]. IEEE transactions on knowledge and data engineering vol:25 no:1 year
2013 a fast clustering-based feature subset selection algorithmfor high
dimensional data
[3] Arauzo-Azofra A., Benitez J .M. and Castro J .L., A feature set measure
based on relief, In Proceedings of the fifth international conference on
Recent Advances in Soft Computing, pp 104-109, 2004.
[4] Baker L.D. and McCallum A.K., Distributional clustering of words for
text classification, In Proceedings of the 21st Annual international ACM
SIGIR Conference on Research and Development in information Retrieval,
pp 96-103, 1998.
[5] Bell D.A. and Wang, H., A formalismfor relevance and its application
in feature subset selection, Machine Learning, 41(2), pp 175-195, 2000.
[6] Butterworth R., Piatetsky-Shapiro G. and Simovici D.A., On Feature
Selection through Clustering, In Proceedings of the Fifth IEEE international
Conference on Data Mining, pp 581-584, 2005.
[7] Chikhi S. and Benhammada S., ReliefMSS: a variation on a feature
ranking ReliefF algorithm. Int. J . Bus. Intell. Data Min. 4(3/4), pp 375-390,
2009.
[8] Dash M. and Liu H., Feature Selection for Classification, Intelligent Data
Analysis, 1(3), pp 131-156, 1997.
[9] Dash M., Liu H. and Motoda H., Consistency based feature Selection, In
Proceedings of the Fourth Pacific Asia Conference on Knowledge Discovery
and Data Mining, pp 98-109, 2000.
[10] Das S., Filters, wrappers and a boosting-based hybrid for feature
Selection, In Proceedings of the Eighteenth International Conference on
Machine Learning, pp 74-81, 2001.
[11] Dash M. and Liu H., Consistency-based search in feature selection.
Artificial Intelligence, 151(1-2), pp 155-176, 2003.
[12] Demsar J ., Statistical comparison of classifiers over multiple data sets,
J .Mach. Learn. Res., 7, pp 1-30, 2006.
[13] Dhillon I.S., Mallela S. and Kumar R., A divisive information theoretic
feature clustering algorithm for text classification, J . Mach. Learn. Res., 3,pp
1265-1287, 2003.
[14] Fayyad U. and Irani K., Multi-interval discretization of continuous-
valued attributes for classification learning, In Proceedings of the Thirteenth
International J oint Conference on Artificial Intelligence, pp 1022-1027,
1993.
[15] Fleuret F., Fast binary feature selection with conditional mutual
Information, J ournal of Machine Learning Research, 5, pp 1531-1555, 2004.
[16] Forman G., An extensive empirical study of feature selection metrics
for text classification, J ournal of Machine Learning Research, 3, pp 1289-
1305, 2003.
[17] Garcia S and Herrera F., An extension on Statistical Comparisons of
Classifiers over Multiple Data Sets for all pairwise comparisons, J .
Mach.Learn. Res., 9, pp 2677-2694, 2008.
[18] Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M.,
Mesirov, J .P., Coller, H., Loh, M.L., Downing, J .R., and Caligiuri, M. A.,
Molecular classification of cancer: class discovery and class prediction by
gene expression monitoring. Science, 286(5439), pp 531-537, 1999.
[19] Guyon I. and Elisseeff A., An introduction to variable and feature
selection, J ournal of Machine Learning Research, 3, pp 1157-1182, 2003.
[20] J ohn G.H., Kohavi R. and Pfleger K., Irrelevant Features and the
Subset Selection Problem, In the Proceedings of the Eleventh International
Conference on Machine Learning, pp 121-129.
[21] Kohavi R. and J ohn G.H., Wrappers for feature subset selection,
Artif.Intell., 97(1-2), pp 273-324, 1997.
[22] Yu L. and Liu H., Feature selection for high-dimensional data: a fast
correlation-based filter solution, in Proceedings of 20th International
Conference on Machine Leaning, 20(2), pp 856-863, 2003.

You might also like