You are on page 1of 12

An Idea of Setting Weighting Functions for Feature Selection

Li Weijie Xu Yong Yao Lu


Bio-Computing Research Center, Harbin Institute of Technology Shenzhen Graduate School

Abstract—In this paper, we propose a novel feature selection method, which improves effectively
traditional mutual information based feature selection. The method takes as the first step
traditional mutual information based feature selection. Then the method multiplies each feature by
a weighting coefficient that is directly related to the mutual information value between the feature
and class labels. Finally the multiplication results of the features with large mutual values are used
as final features for classification. The result of nearest neighbor (NN) classification on spam
emails filter and prediction of molecular bioactivity shows that the proposed method is able to
improve the performance of NN classification. In additional, using fewer features NN
classification is capable of achieving the same accuracy as NN classification using all of original
features.
Index Items—Pattern recognition; Feature selection; Nearest neighbor classification; Mutual
information; Weighting coefficient

1 Introduction

Mutual information based feature selection has received much attention in the field of feature
selection. The mutual information of two random variables is a quantity that measures the mutual
dependence of the two variables. In other words, the larger the mutual information value between
two random variables is, the more information one random variable tells about another one. For a
pattern classification problem, mutual information based feature selection allows features having
high correlation with class labels to be selected. As a result, the obtained features may lead to
higher classification accuracy than naïve features.

Nearest neighbor classification often follows feature selection to perform classification. Note
that most of methods [1, 2, 3 and 4] coequally treat all features obtained through feature selection
though they are associated with different mutual information values. On the other hand, according
to the definition of mutual information between features and class labels, it is easy to know that
generally a feature with a large mutual information value has high correlation with the class label
of the corresponding sample. As a result, it appears that a feature with a larger mutual information
value is more significant for classification decision in comparison with others. It will be
reasonable that the distance between two samples is primarily evaluated by using the distances
between the features with large mutual information values and samples are classification
according to the distance. On the contrary, coequally treating all features with different values
appears to be a drawback of previously developed methods. We can take a simple digital logical
circuit as an example,

Y = A(B+B) (1) where the inputs are A and B , the output

is Y . The following figures show the circuit structure and truth table of Eq.1.

Fig. 1. Circuit structure of Eq.1.

Fig. 2. Truth table of Eq.1. Fig. 3. Pattern classification of Eq. 1.


Eq.1 can be taken as a two-category pattern classification problem where the outputs “0” and
“1” respectively denote the first and second class, as shown in Fig. 3. Suppose that the class labels

of X1 , X 2 and X 3 are known and the class label of X=(1,1) is unknown. It is difficult for us to

employ the nearest neighbor rule using Euclidean distance to classify an example. Indeed, the

distance between X and X 2 is the same as the distance between X and X 3 and these two

distances are both shorter than that of between X and X1 . However, we can reduce the uncertainty

of classification on X . From Eq.1, we know that the value of A has a higher correlation with
class label than B . Thus we designate different weights for A and B . If the weights of A and B

are set to be 1 and 0 respectively, then the samples X , X1 , X 2 and X 3 will be transformed into

X  (1, 0) , X1  (0, 0) , X 2  (0, 0) and X 3  (1, 0) by each feature multiplying its weight

respectively. As a result, the distances from X to X1 , X 2 and X 3 will be 1, 1 and 0, respectively.

As a result, X will be correctly classified into the second class according to these distances.
NN classification, which has been widely used in the domain of pattern classification and
machine learning since Cover and Hart published their works in 1960s [5, 6], considers that a
sample with unknown class label belongs to the same class as its nearest neighbor in a set of
samples with known class labels. For this decision rule, no explicit knowledge of the underlying
distributions of the data is needed, known as nonparametric techniques [7]. A well-known
characteristic of the nearest neighbor rule is that, for all distributions, its probability of error is
bounded above by twice the Bayes probability of error [5]. A disadvantage of NN classification is
that its implementation requires to store all samples of the training set, and to compare each
sample point to be classified with each training sample. In order to reduce both of space and time
complexities, several techniques such as instance-based, lazy, memory based and case-based
learners have been proposed [8]. After the size of the training set being reduced, the NN
classification results in lower computational complexity. These methods can be grouped into three
classes depending on the objectives that they want to achieve [9].
In this paper, we firstly use mutual information to perform feature selection and then multiply
each selected feature by a weighting coefficient to obtain a final feature that will be employed for
classifying samples. The weighting coefficient is directly related to its mutual information value
between the selected feature and class labels. Then we apply the nearest neighbor rule, which is
able to classify sample without any prior knowledge, to perform classification. This paper is
organized as follows. First, feature selection based on mutual information is introduced in Section
2. Then we present in details our method of improving mutual information based feature selection
in Section 3. After that, we assess the performance of our method in Section 4, through
experiments on two data sets. We offer the conclusion in the final section.

2.Feature Selection and Mutual Information

Feature selection is of importance in the filed of pattern recognition [10]. For pattern
classification problems that inherently possess high-dimensional features, feature selection is a
good resort to reduce the computational cost. Especially in domains such as biological data
analysis that suffers from curse of dimensionality, feature selection appears to be quite significant.
In addition, gains in accuracy can also be expected after feature selection because feature selection
can eliminate irrelevant features that may act as noise. A number of approaches such as sequential
forward selection, sequential backward selection and sequential forward floating selection have
been proposed [11].
Among feature selection approaches, mutual information based feature selection has received
much attention [12, 13 and 14]. The general procedure of widely used mutual information based
feature selection is as follows. Firstly mutual information values between features and class
labels are calculated. Then high-valued features are selected and the low-valued features are
simply discarded. Indeed before the concept of mutual information is introduced in the filed of
pattern recognition, it has been used in the field of communication to measure the reduction of
uncertainty of one variable with respect to the knowledge of other variables. Under the condition
of the uncertainty of a variable y on another variable z becoming smaller, we can say that
known y is helpful in determining the value of z . Additionally, the larger the mutual
information value between y and z is, the more useful y is in determining z . Similarly, for
pattern classification problems, the higher the mutual information value between a feature and
class labels is, the better the feature is. In fact, this is the rationale of mutual information based
feature selection for classification.
Suppose that there are n training samples in total. The training set can be formulated as

follows: E  { X j , c j | j  1,2,3,...n} . X j is the j  th example consisting of multi-feature

components and is defined as X j  { X j1 , X j 2 ,..., X jd } . Given that there are p classes, c j will

satisfy c j  C  {c1 , c2 ,..., c p }, j  1, 2,..., n , where c1 , c2 ,..., c p are class labels of p classes,

respectively. Hereafter we denote the i − th feature of a feature-vector by xi . The mutual

information of xi and C can be expressed in terms of

I ( xi ; C ) = H (C ) − H (C | xi ) (2) where I ( xi ; C ) means the mutual

information between xi and C , H (C ) is the entropy of the set of class label C , and H (C | xi )

is the relative entropy of C on xi . They can be respectively calculated using,

H (C )   j 1 p (c j )log ( p (c j ))
p
(3)

H (C | xi )   k 1  j 1 p (c j | X ki ) log( p(c j | X ki ))
n p
(4) where X ki

denotes the i − th component of k − th training sample, and p (.) the probability.

From (2), (3) and (4), we can get


p (c j , X ki )
I ( xi ; C )   k 1  j 1 p (c j , X ki )log
n p
(5)
p (c j ) p ( X ki )

Now we can get each feature’s mutual information using Eq.5. Then we sort the candidate
features by their mutual information using some sorting method. Top features whose mutual
information is greater than some threshold are selected out.

3. Weighted features

3.1 Devise weighting functions

In previous literatures [1, 2, 3 and 4], features selected using mutual information are usually
used directly to classify samples. Indeed, as we mentioned above different features appear to be of
different significance for classification decision. To emphasize the greater significance on
classification decision of features with larger mutual information value in comparison with other
features, in this section we aims at devising schemes of assigning a weighting coefficient to each
feature.
We adopt the following principle to set the weighting coefficient: the higher the mutual
information, the greater the weight of the feature. The weighting coefficient can be functioned as
follows:
wi = φ ( I ( xi ; C )) , (6) φ namely

weighting function is a function with the following properties:

Nonnegativity: for any x , φ ( x) ≥ 0

Nondescending: for any x1 > x2 , φ ( x1 ) ≥ φ ( x1 ) .

In practice, φ can be selected as one of monotone non-increasing functions such as constant

functions, linear functions, power functions, exponential functions, and logarithm functions,
which are showed in Fig. 4. Notice that different types of functions provide different kinds of
weighting coefficients with the features. For example, a linear function will assign to a feature
weighting coefficient in proportion to the mutual information value while a constant function
assigns identical weighting coefficients to all features. It seems that by using an exponential
function as the weighting function we can provide the most emphasis on the significance of the
features with large mutual information values. In addition, the other two functions as shown in ‘c’
and ‘e’ of Fig. 4 allow to produce medium emphasis on the features with middle-scale mutual
information values and to produce general emphasis on other features. Note that since a constant
function assigns an identical weighting coefficient to each feature, NN classification on final
features obtained using a constant weighting function will be identical to NN classification on
original features.
12
the logarithm function
the constant function the linear function the power function x 10 the exponential function 3
2 1 100 12
2.9
1.8 0.9 90
10 2.8
1.6 0.8 80
2.7
1.4 0.7 70
8
2.6
1.2 0.6 60
y=ln(1000*x)
y=exp(30*x)
y=100*sqrt(x)

6 2.5
1 0.5 50
y=x
y=1

0.8 0.4 40
2.4

4
0.6 0.3 30 2.3

0.4 0.2 20 2.2


2
0.2 0.1 10 2.1

0 0 0 0 2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

a b c d e
Fig. 4. Five different weighting functions. “a”, “b”, “c”, “d”, “e” respectively show a constant function, linear
function, power function, exponential function, and logarithm function

3.2 Produce final features and classification

We introduce the following weighting matrix to present weighting coefficients of all the
features:
 w1 0 

W  O  (7) where d is the number of features. By
 
 0 wd 
multiplying features of a sample by the weighting matrix, we can transform them into a new form
which is named as final features. For example, if the original features vector of a sample is
X  ( x1 , x2 ,..., xd ) , it will be transformed into the final feature vector Y as follows through the

multiplication operation:
Y  XW  ( w1 x1 , w2 x2 ,..., wd xd ) , (8)

Let E be a set of n training samples and c1 , c2 ,..., c p denote p different classes. Suppose that

each class has ni training samples, where i  1, 2,..., p . Then the distance between a new sample

Y and the i  th class can be calculated using:


gi (Y )  min || Y  Yi k ||, k  1, 2,..., ni , (9)
k

k
where Yi is the k-th training sample belonging to the i-th class. The class label of Y can be

determined by
j  arg min gi (Y ), i  1, 2,..., p . (10)
i

4. Experimental results

We use two data sets, a spam email filter data set [20] and the dataset associated with task 1
of the KDD 2001 [22], to test our method. The spam emails filter aims at distinguishing
unsolicited bulk emails, spam emails, from legitimate emails while the task 1 of KDD 2001
attempts to recognize the active compounds from the inactive ones. We evaluate respectively the
total accuracy, 1-accuracy, and 0- accuracy of our method. Hereafter Let 1 denote the spam emails
or the active compounds, while 0 is used to denote the legitimate emails or the inactive

compounds. n10 means the number of samples that are classified into class 0 whereas its genuine

class is 1. n01 is the number of samples that are classified into class 1 whereas its genuine class is

0. As a result, the total accuracy, 1-accuracy, and 0- accuracy can be respectively calculated as
follows:
n11  n00 n
AC  (11) 1AC  n  n
11

n11  n10  n00  n01 11 1 0


n
(12) 0AC  n
0 0
(13)
0 0  n0 1

In general, a high accuracy indicates low sum of 1  0 and 0  1 error rate, and high 1-
accuracy and 0-accuracy implies low 1  0 and 0  1 error rate.

4.1 Spam Email Filter

Our first dataset for experimentation, i.e. the email dataset mentioned above, is obtained from
China Anti-Spam Alliance, CASA [21]. The training data set is a subset of the sample set collected
in June 2006 while the test set is a subset of the samples set collected in July 2006. Each sample
set consists of 5000 spam emails and 5000 legitimate emails. The test data set composed of 800
spam emails and 400 legitimate emails was collected by the third part [20]. In this experiment, we
used three different training subsets to test our method. For a training subset, the number of the
spam emails is the same as that of legitimate emails. The first training subset, the second training
subset and the third training subset consist of 1000, 1500 and 2000 spam emails, respectively. Also
the legitimate emails included in the three training subsets are 1000, 1500 and 2000, respectively.
For each case, we also test our method under the conditions that the first 400, 600 and 800 features

of each sample are used for classification. Weighting function φ ( I ( xi ; C ))  I ( xi ; C ) , which

takes as the weighting coefficient the mutual information value between each original feature and
class labels is employed to produce the weight matrix W . After transforming original features into
final features using W , we performed classification experiments. Table 1 shows the total
accuracies of NN classification on original features and final features. Table 2 and Table 3 show 0-
accuracy and 1-accuracy of NN classification, respectively.
Table 1 Total accuracies of NN classification on original features and final features
1000+1000 1500+1500 2000+2000
Original Final Original Final Original Final
features features features features features features
40 87.2% 90.3% 91.5% 91.4% 90.8% 90.5%
0
60 85.5% 91.4% 88.5% 91.0% 88.3% 90.3%
0
80 83.9% 91.3% 87.2% 91.3% 87.0% 90.3%
0

Table 2 0-accuracies of NN classification on original features and final features


1000+1000 1500+1500 2000+2000

Original Final Original Final Original Final


features features features features features features
400 68.0% 90.0% 78.0% 90.3% 77.3% 92.5%

600 63.5% 90.5% 73.8% 89.8% 72.3% 92.0%

800 56.8% 90.0% 70.8% 96.5% 66.5% 91.5%

Table 3 1-accuracies of NN classification on original features and final features


1000+1000 1500+1500 2000+2000
Original Final Original Final Original Final
features features features features features features
40 89.1% 90.5% 98.3% 92.0% 98.9% 89.5%
0
60 89.1% 91.9% 95.8% 91.6% 96.4% 89.5%
0
80 89.2% 92.0% 95.4% 92.0% 97.3% 89.8%
0

In these tables, “1, 000 + 1, 000” means that the training sample set consists of 1, 000 spam
emails and 1, 000 legitimate emails. “400” means that the number of features used for
classification is 400.
From these tables, we can see that the result of NN classification on final features is better
than NN classification on original features. It is also noticeable that while more features are used
for classification, NN classification on original features obtains lower total accuracy and 0-
accuracy. This implies that for original features, when features with small mutual information
values are added to the set of features exploited for classification, they are not able to provide very
useful information for improving classification decision and even they may lowdown
classification performance. It seems that these features not only have large correlation with class
labels of samples but also may prevent the features with large mutual information values from
exerting their influence on classification. On the other hand, by designating larger weighting
coefficients to features with large mutual information values our method emphasizes
proportionally the significance in classification of these features. Consequently, because our
method exploits properly information of different features, compared with NN classification on
original features our method is able to achieve higher accuracy when using the same number of
features. In addition, it appears that the classification performance associated with our method
does not decrease with increasing number of features. From another point of view, in comparison
with NN classification on original features NN classification on final features may achieve the
same accuracy by exploiting fewer features. This allows the computational cost of NN
classification to be cut down in some extent.

4.2 Prediction of Molecular Bioactivity

Our second experiment tests our method using the dataset associated with task 1 of KDD 2001
by predicting molecular bioactivity of drugs [21]. The training data set concludes 1, 909
compounds. Of these compounds, 24 are active and the others are all inactive. Each compound is
described by a feature vector comprising a class value, active or inactive indicator, and 139, 351
binary numbers. These binary numbers are generated in an internally consistent manner for all
1909 compounds. The test set contains 636 additional compounds that were in fact generated
based on the assay results recorded for the training set. And the test set has the same format as the
training set, with the exception that the activity value for each data point is missing.
As a two-class classification problem including active and inactive compounds, this problem has
the following characteristics. First, the dimension (139, 351) of the feature vector is huge. Second,
inactive compounds are much more than active compounds. Third, the size of the training set is
too small in comparison with the dimension of the sample, which means that the training set
cannot represent well the true distribution of the sample space.
In the experiment, the number of features varies from 100 to 2, 000. We test our method using
the following weighting functions:

the constant function ( CF) φ ( xi ; C ) = 1 ;

the linear function ( LF ) φ ( I ( xi ; C )) = I ( xi ; C ) ;

the power function ( PF) φ ( I ( xi ; C )) = 100 I ( xi ; C ) ;

the exponential function (EF), φ ( xi ; C ) = exp(30 × I ( xi ; C )) ; and

the logarithm function (GF) φ ( I ( xi ; C )) = ln(1000 × I ( xi ; C )) .

a
b

c
Fig. 5 Result of example 2. (a) shows the result of accuracies of NN classification with different
weighting functions. (b) and (c) show the distribution of the 1-accuracies and the 0-accuracy of
NN classification with different weighting functions
Fig. 5.a, 5.b and 5.c show respectively the total accuracy, 1-accuray and 0-accuracy of NN
classification with different weighting functions. Note that NN classification with constant
weighting function is the same as NN classification on original features. Fig. 5.a shows it is
possible for NN classification on final features to obtain a higher total accuracy than NN
classification on original features. Fig. 5.b shows that the 1-accuracy of NN classification on final
features can be much higher than that of NN classification on original features. Fig. 5.c shows that
performance of our method is better than or equal to that of the naive nearest neighbor
classification. Some reasons can be following ones. There are 139, 251 features in each
compound. It is too huge! After obtaining each feature’s mutual information from Eq.5, we find
that many features have the same scalar of mutual information, which means that we may need
some technology to work over them. It is noticeable that the active compounds in training set are
so few whereas the dimensionality of the features is quite high. To directly measure sample similar
merit classification based on all available features will be very time consuming and even
impractical. The fact that our method still achieves a better performance for this special data set
having very high-dimensional features than the naive nearest neighbor classification shows that
our method is feasible and effective. In addition, our method has another advantage. It is that our
method makes nearest neighbor classification feasible for this very high-dimensional dataset.

5. Conclusion

The key of the method developed in this paper is to emphasize different significance in
classification of different original features, which are obtained on the basis of mutual information
based feature selection. Different from existing methods’ equivalently treating all original features,
our method assigns large weighting coefficients to features associated with large mutual
information values and takes the product of each feature and the weighting coefficient as the final
feature used for classification.
The developed weighting function schemes are able to offer substantial improvements for
NN classification. The result of the first experiment, spam mails filter, shows that our rule is
effective while that of the second one, prediction of molecular bioactivity, which is more slashing,
shows that the rule is useful. If the number of candidate features is huge, the result may be not
better than the naive nearest neighbor rule. Another advantage of NN classification on final
features is that in comparison with NN classification on original features it may achieve the same
accuracy by using fewer features.

Acknowledgements

We wish to thank anyone who gives us any advice about this paper.

References

[1] Domeniconi, C., Gunopulos ,D., and Peng, J.. Large Margin Nearest Neighbor Classifiers. IEEE Transaction on
Neural Networks, 2005:16(4), pp. 899-909.
[2] Hastie, T., Tibshirani, R. Discriminant Adaptive Nearest Neighbor Classification. IEEE Transaction on PAMI,
1998: 18(6), pp.607-615.
[3] Peng, J., Heisterkamp, D. R., Dai, H.K. LDA/SVM Driven Nearest Neighbor Classifiers. IEEE Transaction on
Neural Networks, 2003: 14(4), pp.940-942.
[4] Domeniconi, C., Peng, J., Gunopulos, D. Locally Adaptive Metric Nearest Neighbor Classification, IEEE
Transaction on PAMI, 2002, pp.1281-1285.
[5] Cover, T.M. and Hart, P.E. Nearest neighbour pattern classification. IEEE Transactions on Information Theory,
1967:13, pp.21-27.
[6] Hart, P.E. The Condensed Nearest Neighbor Rule. IEEE Transactions on Information Theory, 1968:14, pp.515-
516.
[7] Richard O.D., Peter E.H. and David G.S. Pattern Classification. Beijing, China: China Machine Press. 2004:
174-187.
[8] Wilson D. and Martinez T. Reduction techniques for instance-based learning algorithms. Machine Learning,
2000: 38, pp.257–286.
[9] Brighton, H. and Mellish, C. Advances in instance selection for instance-based learning algorithms. Data
Mining and Know. Discovery, 2002:6, pp.153–172.
[10] Jiawei Han and Micheline Kamber. Data Mining Concepts and Techniques. China: China Machine Press.
2006: 72-86.
[11] Zongker, D. and Jain, A.K. (1996). Algorithms for feature selection: an evaluation. In Proceedings of
International Conference on Pattern Recognition, pp. 18-22, Vienna, IEEE Computer Society Press, Los
Alamitos, CA.
[12] YANG Sheng and GU Jun. Feature selection based on mutual information and redundancy-synergy
coefficient. Journal of Zhejiang University SCIENCE, 2004:5(11), pp.1382-1391.
[13] Claude E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 1948:27, pp.
379–423, 623–656.
[14] Cover, T.M. (1991). Elements of Information Theory. Wiley, New York.
[15] Henley, W.E. and Hand, D.J. A k-nearest-neighbour classifier for assessing consumer credit risks. The
Statistician, 1996: 45(1): 77-95.
[16] Salzberg, S. (1990). Learning with Nested Generalized Exemplars. Norwell, MA: Kluwer Academic
Publishers.
[17] Trevor Hastie, Robert Tibshirani. Discriminat Adaptive Nearest Neighbor Classification. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 1996: 18(6), PP. 607-615.
[18] Scott C, Steven S. A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features. Machine
Learning, 1993: 10 (1): 57-78.
[19] Hand, D.J. (1997). Construction and Assessment of Classification Rules. Wiley, Chichester.
[20] ICRC-HITSZ. http://219.223.242.128/ai/
[21] China Ant-spam alliance. http://anti-spam.org.cn/
[22] KDD Cup 2001. www.cs.wisc.edu/~dpage/kddcup2001/

Li Weijie received the B.S degree in 2006 from Guilin University of Electronic Technology.
He now is in Harbin Institute of Technology Shenzhen Graduate School for his M.S. degree. His
current interests include pattern classification and image processing.

Xu Yong received the B.S degree and the M.S. degree in 1994 and 1997. He received his
Ph.D. degree in pattern recognition and intelligence system from Nanjing University of Science
and Technology, China in 2005. Now he is in Harbin Institute of Technology Shenzhen Graduate
School and a professor. His current interests include face recognition, handwritten character
recognition, linear and nonlinear feature extraction methods.

Yao Lu received the B.S. degree in 2006 from Heilongjiang Institute of Technology. She now
is in Harbin Institute of Technology Shenzhen Graduate School for her M.S. degree. Her current
interests include pattern classification.

You might also like