Professional Documents
Culture Documents
A Novel Fitness Function in Genetic Algorithms to Optimize Neural Networks for Imbalanced Data Sets
Kuan-Chieh Huang Yau-Hwang Kuo Dept. of Computer Science and Information Engineering
National Chen-Kung University, Tainan, Taiwan
icyeh@chu.edu.tw
1. Introduction
Genetic algorithms have been popular optimizations and search techniques since John Holland introduced the adaption in natural and artificial systems [1]. A fitness function in genetic algorithms plays an important role in defining search space. The major advantages of genetic algorithms are without computing gradient information and the capability of searching global near-optimum solutions. Imbalanced data sets are often encountered in business, industry and real life applications. A number of studies have investigated the imbalanced classification problems [2-4]. While over-sampling and under-sampling are solutions for the imbalanced classification problems, over-fitting and critical information-lacking are usually the side effects. A back-propagation neural network (BPN) has been widely used in classification problems due to its classified capability and the performance. An iterative gradient algorithm is applied to minimize the mean square error between the outputs of a neural network
978-0-7695-3382-7/08 $25.00 2008 IEEE DOI 10.1109/ISDA.2008.252 647
and the desired outputs through optimizing weights of the neural network. However, the local optima solutions and the suitable structure of networks are major disadvantages of neural networks. Several studies have suggested that genetic algorithms are used to overcome these shortcomings [5-6]. However, how to solve the imbalanced classifications effectively and efficiently is still an important issue. The purpose of this paper is to propose a novel fitness function in genetic algorithms to optimize structures and weights of neural networks for imbalanced data sets. The novel fitness function consists of the following information: (a) the mean square error between the outputs of a neural network and the desired outputs; (b) the recognition error rate of each neuron; In addition, (c) the distances between the examples and the boundary of classification. Experimental results showed that the fitness function we proposed performed well in the imbalanced classification problem.
To overcome the shortcomings of conventional back-propagation neural networks for the imbalanced date sets classification. The novel fitness function we proposed consists of the following information: (a) the root of mean square error between the outputs of a neural network and the desired outputs; (b) the recognition error rate of each neuron; In addition, (c) the distances between the examples and the boundary of classification. The fitness function can be expressed as Figure 1. The boundary of classification The fitness function= where
1 1 + (I R + I E + I D )
(2)
(T
IR =
IE =
j =1 N
p j
Y jp ) 2
(3)
p =1 j =1
M N NF Nj 1 D1 D 2 j j NH
(4)
ID =
j =1
(5)
f ( x) =
1 . 1 + ex
Y jp denotes the actual output of the output neuron j of training data p, M denotes the number of the training data, N denotes the number of the output neuron, N F denotes the
neuron j of training data p, number of incorrect classification belongs to the output neuron j, N j denotes the number of classification belongs to the output neuron j, N H denotes the number of the nodes in the last hidden layer, and
(1)
In the conventional MLP, the number of the hidden layer and the number of the nodes in the hidden layer are always determined by operators. Since they are closely related to the performance of the MLP, not only these parameters but also the links-pruning between neurons are regarded as an optimization problem.
D 2 denotes j
the average of distances of the top 5% training data, which are near to the boundary of classification and belong to the other class.
648
4. Experimental results
4.1. Artificial data
The artificial data were used to test the classifier with the novel fitness function. In this experiment, we constructed 3922 examples as shown in Figure 4(a) that consists of two classes. The number of training data sets was 2008; the remains were the test data sets. The ratio of the number of two classes was 51.293. Class 1:
0.0264 0.0267
0.0000 0.0244
0.0268 0.0267
(6) (7)
(a)The distribution of artificial data
X2 X1
x1
x2
X1
Figure 3. The structure of the neural network after training (The dotted line and circle are inactive.) Table 1 presents the classification error rate of each class and the total examples. Although the error rate of the back-propagation neural network (BPN) is better than the neural networks with the fitness function we proposed, the error rate of class1 which has smaller examples in the BPN is worse. To compare with the error rate of class 1 and class2 in test data sets, the method we proposed performs more balanced than the BPN. The structure of the neural network after training is shown in Figure 3. The dotted lines and circle are inactive, and the result shows that the variable x2 is not an important factor for this classifier. This result is also reflected in the boundary of classification (see Figure 4). Table 1. The classification error rate of artificial data
Back-propagation neural network
Training data sets Test data sets
X2
X1
(c)The classification boundary of the neural network with the fitness function we proposed
649
The results showed in Table 2. The error rate of the method we proposed was more balanced than the BPN. 2. Table 2. The classification error rate of Heart database set
Back-propagation neural network (I-H-O: 20-2-2)
Training data sets Test data sets
information of distance performs well in imbalanced data sets. The structures of classifiers are less than the conventional back-propagation neural network to achieve same results.
6. References
Normal 0.0408 0.1346
[1] J. Holland, Adaption in natural and artificial systems, MIT Press, 1975. [2] M. D. d. Castillo, and J. I. Serrano, A multistrategy for digital text categorization from imbalanced documents, Special issue on learning from imbalanced datasets, ACM SIGKDD Explorations Newsletter Publisher, 2004, pp. 7079. [3] L. Xu and M.Y. Chow, A Classification Approach for Power Distribution Systems Fault Cause identification, IEEE Trans. on Power System, vol.21, no. 1, 2006, pp. 53-60. [4] X. Hong, S. Chen, and C. J. Harris, A Kernel-Based Two-class Classifier for Imbalanced Data Sets, IEEE Trans. on Neural Network, vol.18, no. 1, 2007, pp 28-41. [5] F. H. F. Leung, H. K. Lam, S. H. Ling, and P. K. S. Tam, Tuning of the Structure and Parameters of a Neural Network Using an Improved Genetic Algorithm, IEEE Trans. on Neural Network, vol.14, no. 1, 2003, pp 79-88. [6] J. D. Schaffer, D. Whitley, and L. J. Eshelman, Combinations of genetic algorithms and neural networks: A survey of the state of the art, in Proc. Int. Workshop Combinations Genetic Algorithms Neural Networks, 1992, pp. 1-37. [7] UCI Machine Learning Repository, 2007, (http://www.ics.uci.edu/~mlearn/MLRepository.html).
0.1222 0.1341 0.1122 0.1222 0.1316 0.1154 I: the number of input variables. H: the number of nodes in the hidden layer. O: the number of nodes in the output layer.
Training data sets Test data sets
In the iris data set [7], it contained three classes of fifty examples each, where each class refers to a type of iris plane. Despite the fact that the data set is not imbalanced, we also verify the fitness function for the structure of the classifier. As shown in Table 3, the structure of the method we proposed was less than the BPN to achieve the same results. Table 3. The classification error rate of Iris database
Back-propagation neural network (I-H1-H2-O: 4-8-3-3)
Total
Training data sets Test data sets
0.0267 0.0267
0.0267 0.0000 0.0800 0.0000 I: the number of input variables. H1: the number of nodes in the first hidden layer. H2: the number of nodes in the second hidden layer. O: the number of nodes in the output layer.
0.0000
0.0000
0.0000
0.0000
5. Conclusion
In this paper, we have proposed the novel fitness function in genetic algorithms to optimize neural networks for imbalanced data sets. Based on the experimental results, the following conclusions may be made: 1. The fitness function which includes the classification error rate for each class and the
650