You are on page 1of 4

Eighth International Conference on Intelligent Systems Design and Applications

A Novel Fitness Function in Genetic Algorithms to Optimize Neural Networks for Imbalanced Data Sets
Kuan-Chieh Huang Yau-Hwang Kuo Dept. of Computer Science and Information Engineering
National Chen-Kung University, Tainan, Taiwan

I-Cheng Yeh Dept. of Information Management


Chung Hua University, Hsin-chu, Taiwan

kchuang@ismp.csie.ncku.edu.tw kuoyh@ismp.csie.ncku.edu.tw Abstract


The imbalanced data sets are often encountered in business, industry and real life applications. In this paper, the novel fitness function in genetic algorithms to optimize neural networks is proposed for solving the classification problems in imbalanced data sets. Not only the parameters of neural networks but also the links-pruning between neurons are regarded as an optimization problem in this study. The fitness function consists of the mean square error, the classification error rate for each class, the distances between the examples and the boundary of classification. The artificial data set and the UCI data sets are used to verify the classifier we proposed. The experimental results showed that the classifier performs better than the conventional back-propagation neural network.

icyeh@chu.edu.tw

1. Introduction
Genetic algorithms have been popular optimizations and search techniques since John Holland introduced the adaption in natural and artificial systems [1]. A fitness function in genetic algorithms plays an important role in defining search space. The major advantages of genetic algorithms are without computing gradient information and the capability of searching global near-optimum solutions. Imbalanced data sets are often encountered in business, industry and real life applications. A number of studies have investigated the imbalanced classification problems [2-4]. While over-sampling and under-sampling are solutions for the imbalanced classification problems, over-fitting and critical information-lacking are usually the side effects. A back-propagation neural network (BPN) has been widely used in classification problems due to its classified capability and the performance. An iterative gradient algorithm is applied to minimize the mean square error between the outputs of a neural network
978-0-7695-3382-7/08 $25.00 2008 IEEE DOI 10.1109/ISDA.2008.252 647

and the desired outputs through optimizing weights of the neural network. However, the local optima solutions and the suitable structure of networks are major disadvantages of neural networks. Several studies have suggested that genetic algorithms are used to overcome these shortcomings [5-6]. However, how to solve the imbalanced classifications effectively and efficiently is still an important issue. The purpose of this paper is to propose a novel fitness function in genetic algorithms to optimize structures and weights of neural networks for imbalanced data sets. The novel fitness function consists of the following information: (a) the mean square error between the outputs of a neural network and the desired outputs; (b) the recognition error rate of each neuron; In addition, (c) the distances between the examples and the boundary of classification. Experimental results showed that the fitness function we proposed performed well in the imbalanced classification problem.

2. The imbalanced classification problem


In two-class classification problems, they can express as Figure 1 and Figure 2. The differences between them are the number of examples in each class and overlapping. A root of mean square and an error rate are regarded as the criteria of learning phase in a conventional BPN. In Figure 1, the boundaries of the classification are potential solutions which satisfy the criteria as mentioned above. Generally speaking, the dotted line in Figure 1 is the minimal risk for the twoclass classification problem. In the classifications of imbalanced data sets which are serious overlapping raise several problems, such as the learning criteria of BPN prefer that the smaller class is classified to the other in the overlapping examples (See Figure 2).

To overcome the shortcomings of conventional back-propagation neural networks for the imbalanced date sets classification. The novel fitness function we proposed consists of the following information: (a) the root of mean square error between the outputs of a neural network and the desired outputs; (b) the recognition error rate of each neuron; In addition, (c) the distances between the examples and the boundary of classification. The fitness function can be expressed as Figure 1. The boundary of classification The fitness function= where

1 1 + (I R + I E + I D )

(2)

(T
IR =
IE =
j =1 N

p j

Y jp ) 2
(3)

p =1 j =1

M N NF Nj 1 D1 D 2 j j NH

(4)

Figure 2. The overlapping examples between two-class

ID =

3. The classification method


3.1. The structure of neural networks
The structure of the feed-forward multi-layer perception (MLP) is applied in this study. The transfer function of the neurons is the sigmoid function which is shown in Equation (1) and

j =1

(5)

T jp denotes the desired output of the output

f ( x) =

1 . 1 + ex

Y jp denotes the actual output of the output neuron j of training data p, M denotes the number of the training data, N denotes the number of the output neuron, N F denotes the
neuron j of training data p, number of incorrect classification belongs to the output neuron j, N j denotes the number of classification belongs to the output neuron j, N H denotes the number of the nodes in the last hidden layer, and

(1)

In the conventional MLP, the number of the hidden layer and the number of the nodes in the hidden layer are always determined by operators. Since they are closely related to the performance of the MLP, not only these parameters but also the links-pruning between neurons are regarded as an optimization problem.

D1 denotes the average of distances of the top 5% j


training data, which are near to the boundary of classification and belong to the class 1.

D 2 denotes j

3.2. The novel fitness function in genetic algorithms

the average of distances of the top 5% training data, which are near to the boundary of classification and belong to the other class.

648

4. Experimental results
4.1. Artificial data
The artificial data were used to test the classifier with the novel fitness function. In this experiment, we constructed 3922 examples as shown in Figure 4(a) that consists of two classes. The number of training data sets was 2008; the remains were the test data sets. The ratio of the number of two classes was 51.293. Class 1:

Training data sets Test data sets


X2

0.0264 0.0267

0.0000 0.0244

0.0268 0.0267

( x1 0.4) 2 + ( x2 0) 2 0.1 2 2 Class 2: ( x1 + 0.3) + ( x2 0) 0.7

(6) (7)
(a)The distribution of artificial data
X2 X1

x1

x2
X1

Figure 3. The structure of the neural network after training (The dotted line and circle are inactive.) Table 1 presents the classification error rate of each class and the total examples. Although the error rate of the back-propagation neural network (BPN) is better than the neural networks with the fitness function we proposed, the error rate of class1 which has smaller examples in the BPN is worse. To compare with the error rate of class 1 and class2 in test data sets, the method we proposed performs more balanced than the BPN. The structure of the neural network after training is shown in Figure 3. The dotted lines and circle are inactive, and the result shows that the variable x2 is not an important factor for this classifier. This result is also reflected in the boundary of classification (see Figure 4). Table 1. The classification error rate of artificial data
Back-propagation neural network
Training data sets Test data sets
X2

(b)The classification boundary of the backpropagation neural network

X1

(c)The classification boundary of the neural network with the fitness function we proposed

Figure 4. The artificial data and boundary

4.2. The UCI data sets


In the heart disease data set [7], we applied 20 variables to the input of classifiers. Heart disease data set was composed of 270 examples, and they were divided into two groups (normal/abnormal). The number of normal data sets was 150; the number of abnormal data sets was 120.

Total 0.0085 0.0100

Class 1 0.4706 0.4634

Class 2 0.0005 0.0000

A neural network with the fitness function we proposed

649

The results showed in Table 2. The error rate of the method we proposed was more balanced than the BPN. 2. Table 2. The classification error rate of Heart database set
Back-propagation neural network (I-H-O: 20-2-2)
Training data sets Test data sets

information of distance performs well in imbalanced data sets. The structures of classifiers are less than the conventional back-propagation neural network to achieve same results.

6. References
Normal 0.0408 0.1346
[1] J. Holland, Adaption in natural and artificial systems, MIT Press, 1975. [2] M. D. d. Castillo, and J. I. Serrano, A multistrategy for digital text categorization from imbalanced documents, Special issue on learning from imbalanced datasets, ACM SIGKDD Explorations Newsletter Publisher, 2004, pp. 7079. [3] L. Xu and M.Y. Chow, A Classification Approach for Power Distribution Systems Fault Cause identification, IEEE Trans. on Power System, vol.21, no. 1, 2006, pp. 53-60. [4] X. Hong, S. Chen, and C. J. Harris, A Kernel-Based Two-class Classifier for Imbalanced Data Sets, IEEE Trans. on Neural Network, vol.18, no. 1, 2007, pp 28-41. [5] F. H. F. Leung, H. K. Lam, S. H. Ling, and P. K. S. Tam, Tuning of the Structure and Parameters of a Neural Network Using an Improved Genetic Algorithm, IEEE Trans. on Neural Network, vol.14, no. 1, 2003, pp 79-88. [6] J. D. Schaffer, D. Whitley, and L. J. Eshelman, Combinations of genetic algorithms and neural networks: A survey of the state of the art, in Proc. Int. Workshop Combinations Genetic Algorithms Neural Networks, 1992, pp. 1-37. [7] UCI Machine Learning Repository, 2007, (http://www.ics.uci.edu/~mlearn/MLRepository.html).

Total 0.0778 0.1444

Abnormal 0.1220 0.1579

A neural network with the fitness function we proposed (I-H-O: 20-2-2)

0.1222 0.1341 0.1122 0.1222 0.1316 0.1154 I: the number of input variables. H: the number of nodes in the hidden layer. O: the number of nodes in the output layer.
Training data sets Test data sets

In the iris data set [7], it contained three classes of fifty examples each, where each class refers to a type of iris plane. Despite the fact that the data set is not imbalanced, we also verify the fitness function for the structure of the classifier. As shown in Table 3, the structure of the method we proposed was less than the BPN to achieve the same results. Table 3. The classification error rate of Iris database
Back-propagation neural network (I-H1-H2-O: 4-8-3-3)

Total
Training data sets Test data sets

Class 1 0.0000 0.0000

Class 2 0.0800 0.0800

Class 3 0.0000 0.0000

0.0267 0.0267

A neural network with the fitness function we proposed (I-H1-O: 4-2-3)

0.0267 0.0000 0.0800 0.0000 I: the number of input variables. H1: the number of nodes in the first hidden layer. H2: the number of nodes in the second hidden layer. O: the number of nodes in the output layer.

Training data sets Test data sets

0.0000

0.0000

0.0000

0.0000

5. Conclusion
In this paper, we have proposed the novel fitness function in genetic algorithms to optimize neural networks for imbalanced data sets. Based on the experimental results, the following conclusions may be made: 1. The fitness function which includes the classification error rate for each class and the

650

You might also like