You are on page 1of 10

This article appeared in a journal published by Elsevier.

The attached
copy is furnished to the author for internal non-commercial research
and education use, including for instruction at the authors institution
and sharing with colleagues.
Other uses, including reproduction and distribution, or selling or
licensing copies, or posting to personal, institutional or third party
websites are prohibited.
In most cases authors are permitted to post their version of the
article (e.g. in Word or Tex form) to their personal website or
institutional repository. Authors requiring further information
regarding Elseviers archiving and manuscript policies are
encouraged to visit:
http://www.elsevier.com/copyright

Author's personal copy

Expert Systems with Applications 38 (2011) 90149022

Contents lists available at ScienceDirect

Expert Systems with Applications


journal homepage: www.elsevier.com/locate/eswa

A support vector machine classier with rough set-based feature selection


for breast cancer diagnosis
Hui-Ling Chen a,b, Bo Yang a,b, Jie Liu a,b, Da-You Liu a,b,
a
b

College of Computer Science and Technology, Jilin University, Changchun 130012, China
Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China

a r t i c l e

i n f o

Keywords:
Breast cancer diagnosis
Rough set theory
Support vector machines
Feature selection

a b s t r a c t
Breast cancer is becoming a leading cause of death among women in the whole world, meanwhile, it is
conrmed that the early detection and accurate diagnosis of this disease can ensure a long survival of
the patients. Expert systems and machine learning techniques are gaining popularity in this eld
because of the effective classication and high diagnostic capability. In this paper, a rough set (RS) based
supporting vector machine classier (RS_SVM) is proposed for breast cancer diagnosis. In the proposed
method (RS_SVM), RS reduction algorithm is employed as a feature selection tool to remove the redundant features and further improve the diagnostic accuracy by SVM. The effectiveness of the RS_SVM is
examined on Wisconsin Breast Cancer Dataset (WBCD) using classication accuracy, sensitivity, specicity, confusion matrix and receiver operating characteristic (ROC) curves. Experimental results demonstrate the proposed RS_SVM can not only achieve very high classication accuracy but also detect a
combination of ve informative features, which can give an important clue to the physicians for breast
diagnosis.
2011 Elsevier Ltd. All rights reserved.

1. Introduction
Worldwide, breast cancer is by far the most common cancer
amongst women, with an incidence rate more than twice that of
colorectal cancer and cervical cancer and about three times that
of lung cancer (http://www.en.wikipedia.org/wiki/Breast_cancer,
last accessed September 2009). It is reported in (http://www.
breastcancer.org/symptoms/understand_bc/what_is_bc.jsp,last accessed September 2009), for women in the US, breast cancer death
rates are higher than those for any other cancer besides lung
cancer. Besides skin cancer, breast cancer is the most commonly
diagnosed cancer among US women. According to the World
Health Organization, about one-third of the cancer burden could
be decreased if cases are detected and treated early (http://
www.who.int/mediacentre/factsheets/fs297/en/index.html,
last
accessed September 2009). The commonly used diagnostic techniques, like mammography and ne needle aspiration cytology
(FNAC), are reported to lack of high diagnostic capability. Therefore, there is an absolute necessity in developing better diagnostic
techniques. Owing to the above-mentioned needs, expert systems

Corresponding author at: College of Computer Science and Technology, Jilin


University, Changchun 130012, China.
E-mail address: liudy@jlu.edu.cn (D.-Y. Liu).
0957-4174/$ - see front matter 2011 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2011.01.120

and machine learning techniques are introduced to help improve


the diagnostic capability. With the help of automatic diagnostic
systems, the possible errors experts made in the course of diagnosis can be avoided, and the medical data can be examined in shorter time and more detailed as well.
This study aims to build an automatic diagnostic system to distinguish benign breast tumor from malignant one, compared with
other classication techniques such as discriminant analysis, random forest methods and articial neural networks the support vector machine (SVM) has been proven advantageous in handling
classication tasks with excellent generalization performance.
SVM as a relatively new machine learning technique was rst
introduced by Vapnik (1995), which seeks to minimize the upper
bound of the generalization error based on the structural risk minimization (SRM) principal that is known to have high generalization performance. Another key feature of SVM is that training
SVMs is equivalent to solving a linear constrained quadratic programming problem. Thus it is unlikely to be trapped in the local
minimum (Cristianini & Shawe-Taylor, 2000). Recently, it has
found its application in a wide variety of elds including handwritten digit recognition (Cortes & Vapnik, 1995), face detection in
images (Osuna, Freund, & Girosi, 1997), text categorization (Joachims et al., 1998), etc. In this paper, we attempt to investigate the
effectiveness of SVM approach in conducting the breast cancer
diagnosis tasks.

Author's personal copy

H.-L. Chen et al. / Expert Systems with Applications 38 (2011) 90149022

When using SVM, we should bear in mind that the choice of


optimal input feature subset and the optimal parameter play a crucial role in building a prediction model with high prediction accuracy and stability, both of them are important because the feature
subset choice inuences the appropriate kernel parameters and
vice versa (Frohlich et al., 2003). Feature selection is an important
issue in building classication systems, which seeks to identify the
signicant features and eliminate the irrelevant ones in order to
build a good learning model. So a technique that can reduce
dimensionality without any prior knowledge just using the information contained within the data set and preserving the meaning
of the original features is strongly desirable. RS theory can be utilized as such a tool to discover data dependencies and reduce the
number of attributes in a data set by purely structural methods
(Pawlak, 1982). The RS attribute reduction algorithm can remove
redundant features and selects a feature subset that has the same
discernibility as the original set of features, leading to better prediction accuracy. From the medical point of view, this aims at identifying subsets of the most important attributes inuencing the
treatment of patients (benign or malignant).
In order to make great use of the advantages of RS theory in preprocessing the breast cancer data, and further improve the classication accuracy of the SVM predictor model, RS_SVM is proposed
to diagnose the breast cancer in this work. This method consists
of two-stages. In the rst stage, RS is employed as an attribute
reduction tool to extract the optimal features. This provides elimination of unnecessary data. In the second stage, the optimal feature subset is used as the inputs to a SVM classier with good
generalization performance. The effectiveness and performance
of the method are demonstrated on WBCD. Experimental results
have shown that RS_SVM has high predicative classication accuracy with much less attributes. It is observed that the proposed
method achieved the highest classication accuracies (99.41%,
100% and 100% for 5050% of training-test partition, 7030% of
training-test partition, and 8020% of training-test partition,
respectively) for a subset that contained ve features.
The remainder of this paper is structured as follows: Section 2
summarizes the methods and results of previous works on breast
cancer diagnosis. Section 3 offers brief background knowledge on
Rough set theory and SVM. The research design is described in Section 4. Section 5 presents the experimental results and discussion
of the proposed method. Finally, conclusions and recommendations for future work are summarized in Section 6.

2. Related work
There are a great deal of techniques have been proposed to deal
with the automated diagnosis of breast cancer problem, and most
of them achieved high classication accuracies. In Quinlan (1996),
10-fold cross-validation with C4.5 decision tree method was used
and the obtained classication accuracy was 94.74%. In Pena-Reyes
and Sipper (1999), fuzzy-GA method was employed and a classication accuracy of 97.36% was obtained. In Setiono (2000), the classication was based on a feed forward neural network rule
extraction algorithm. The reported accuracy was 98.10%. In Goodman et al. (2002), three different methods, optimized learning vector quantization (LVQ), big LVQ, and articial immune recognition
system (AIRS), were applied and the obtained accuracies were
96.7%, 96.8%, and 97.2%, respectively. In Abonyi and Szeifert
(2003), an accuracy of 95.57% was obtained with the application
of supervised fuzzy clustering technique. In Hassanien (2004),
the classication technique used Rough set method reaching a
classication accuracy of 98%. In Sahan et al. (2007), a new hybrid
method based on fuzzy-articial immune system and k-nn algorithm was used and the obtained accuracy was 99.14%. In Polat

9015

and Gunes (2007), least square SVM was used and accuracy of
98.53% was obtained. In Ubeyli (2007), multilayer perceptron neural network, four different methods, combined neural network,
probabilistic neural network, recurrent neural network and SVM
were used, respectively, highest classication accuracy of 97.36%
was achieved by SVM. In Maglogiannis et al. (2009), three different
methods, SVM, Bayesian classiers and articial neural networks
were applied and the obtained accuracies were 97.54%, 92.80%
and 97.90%, respectively. And in Karabatak and Ince (2009), the
method combined with association rules and neural network were
utilized and accuracy of 95.6% was obtained.
3. Theoretical backgrounds
3.1. Basic concepts of rough set theory
Rough set (RS) theory is a new intelligent mathematical tool
proposed by Prof. Pawlak (1982) to deal with uncertainty and
incompleteness. It is based on the concept of an upper and a lower
approximation of a set, the approximation space and models of
sets. The main advantage of RS theory is that it does not need
any preliminary or additional information about data: like probability in statistics or basic probability assignment in Dempster
Shafer theory and membership grade in fuzzy set theory. One of
the major applications of RS theory is the attribute reduction that
is the elimination of attributes. The reduction of attributes is
achieved by comparing equivalence relations generated by sets of
attributes. Using the dependency degree as a measure, attributes
are removed and reduced set provides the same dependency degree as the original. Here, we just give some main concepts of
the RS theory as need for understanding the rough set analysis
done in this work, for a full description of rough set theory and related terms see (Pawlak, 1982, 1996).
3.1.1. Information system
Knowledge representation in rough sets is done via information
system, which is denoted as 4-tuple S = <U, A, V, f>, where U is the
closed universe, a nite set of N objects {x1, x2, . . . , xn}, A is a nite
set of attributes {a1, a2, . . . , an}, which can be further divided into
two disjoint subsets of C and D, A = {C [ D} where C is condition
attributes and D is a set of decision attributes. V = [ aeAVa and Va
is a domain of the attribute a, and f: U  A ? V is the total decision
function called the information function such that f(x, a) e Va for
every a e A, x e U.
3.1.2. Indiscernibility relation
One of the most signicant aspects of RS theory is its indiscernibility relation. The R-indiscernibility relation is denoted by IND(R),
is dened as:

INDR fx; yU  U : 8aR; ax ayg

where a(x) denotes the value of attribute a of object x. If


(x, y) e IND(R), x and y are said to be indiscernible with respect to
R. The equivalence classes of the R-indiscernibility relation are denoted by [x]R. The indiscernibility relation is the mathematical basis
of RS theory.
3.1.3. Lower and upper approximation
In RS theory, the lower and upper approximations are two basic
operations, for any concept X # U and attribute set R # A, X could
be approximated by the lower approximation and upper approximation. The lower approximation of X is the set of objects of U that
are surely in X, dened as:

RX fxU : xR # Xg

Author's personal copy

9016

H.-L. Chen et al. / Expert Systems with Applications 38 (2011) 90149022

The upper approximation of X is the set of objects of U that are possibly in X, dened as:

RX fxU : xR \ Xg

And the R-boundary region of X is dened as:

BndX RX  RX

A set is said to be rough if its boundary region is non-empty, otherwise the set is crisp.
3.1.4. Attribute reduction and core
There often exist some condition attributes that do not provide any additional information about the objects in U in the
information system. So, these redundant attributes can be eliminated without losing essential classicatory information (Kryszkiewicz & Rybinski, 1996). Reduct and core attribute sets are
two fundamental concepts of rough set theory. A reduct attribute
set is a minimal set of attributes from A (the whole attributes
set) that provided that the object classication is the same as
with the full set of attributes. Given C and D # A, a reduct is a
minimal set of attributes such that IND(C) = IND(D). Let RED(A)
denote all reducts of A. The intersection of all reducts of A is referred to as a core of A, i.e., CORE(A) = \ RED(A), the core is common to all reducts.
3.1.5. Dependency degree
Various measures can be dened to represent how much C, a set
of decision attributes, depends on D, a set of condition attributes.
One of the most common measure is the dependency (Pawlak,
1997) denoted as cC(D),is dened as: cC(D) = |POSC(D)|/|U| where
|U| is the cardinality of set U,POSC(D) called positive region, is dened by POSC(D) = [ xeU/D(X). Note that 0 6 cC(D) 6 1, If cC(D) = 1
we say that D depends totally on C, if 0 < cC(D) < 1, we say that D
depends partially on C, and if cC(D) = 0 means that C and D are totally independent of each other.

can dene a canonical hyper plane (Vapnik, 1995) such that


H1:wTx+ + b = 1 for the closet points on one side and
H2:wTx + b = 1 for the closest on the other. Now maximizing
the separating margin is equivalent to maximizing the distance between hyper plane H1 and H2. Hence we can get the maximal width
w
2
between them m x  x  kwk
kwk
. To maximize the margin
the task is therefore:

Minimize gw

1
jjwjj2
2

so that:

yi wT xi b P 1; 8i

Then the learning task can be reduced to minimization of the primal


Lagrangian:

Minimize Lp w; b; ai

n
X
1
jjwjj2 
ai yi wT xi b  1
2
i1

where ai are Lagrangian multipliers, hence ai > 0. The minimum


with respect to b and w of the Lagrangian, Lp, is given by,
n
X
@Lp
a i yi 0
0)
@b
i1

n
X
@Lp
ai yi xi
0)w
@w
i1

Now we substitute back b and w in the primal gives the Wolfe dual
Lagrangian:

Maximize

n
X
i1

ai 

n X
n
1X
ai aj yi yj xTi xj
2 i1 j1

10

so that:
n
X

ai yi 0; ai P 0

11

i1

3.2. Support vector machines for classication


Support vector machines (SVM) is originally developed by Boser et al. (1992) and Vapnik (1995), which is based on the VapnikChervonenkis (VC) theory and structural risk minimization
(SRM) principle (Vapnik, 1995, 1998), by trying to nd the tradeoff between minimizing the training set error and maximizing
the margin, in order to achieve the best generalization ability
and remains resistant to over tting. In addition, one major
advantage of SVM is the use of convex quadratic programming,
which provides only global minima hence avoid being trapped
in local minima. For more details, you can refer to (Cristianini
& Shawe-Taylor, 2000; Vapnik, 1995), which gives a complete
description of the theory of SVM. In this section we will be concentrated with the basic SVM concepts for typical binary-classication problems.
3.2.1. Linear separable case-hard margin SVM
Let us consider a binary classication task: {xi, yi}, i = 1, . . . , l,
yi e {1, 1}, xi e Rd, where xi are data points and yi are corresponding labels. They are separated with a hyper plane given by
wTx + b = 0, where w is an n-dimensional coefcient vector which
is normal to the hyper plane and b is the offset from the origin.
There are many hyper planes that can separate the two classes,
whereas the decision boundary should be as far away from the
data of both classes as possible, the support vector algorithm is
going to look for an optimal separating hyper plane that maximizes
the separating margin between the two classes of data. Since the
wider margin can acquire the better generalization ability. We

Obviously, it is a quadratic optimization problem (QP) with linear


constraints. From Karush KuhnTucker (KKT) condition, we know:
ai(yi(wTxi + b)  1) = 0,Thus, only support vectors have ai 0,which
carry all relevant information about the classication problem.
P
P
Hence the solution has the form: w ni1 ai yi xi i2SV ai yi xi ;
where SV is the number of support vectors. And get b from
yi(wTxi + b)  1 = 0, where xi is support vector. Therefore, the linear
discriminant function takes the form: gx wT x b
P
T
i2SV ai yi xi x b.
3.2.2. Linear non-separable case-soft margin SVM
In practice, the data is always subject to noise or outliers, so it is
impossible to classify two classes accurately. In order to extend the
support vector algorithms to solve imperfect separation, positive
slack variables ni, i = 1, . . . , l (Cortes & Vapnik, 1995) are introduced
to allow misclassication of noisy data points, and a penalty value
C is introduced for the points that cross the boundaries to take into
account the misclassication errors. In fact, parameter C can be
viewed as a way to control over-tting.
Hence the new optimization problem can be reformulated as
follows:

Minimize gw; n

n
X
1
ni
jjwjj2 C
2
i1

12

so that:

yi wT xi b P 1  ni ;

ni P 0

Translate this problem into a Lagarangian dual problem

13

Author's personal copy

9017

H.-L. Chen et al. / Expert Systems with Applications 38 (2011) 90149022

Maximize

n
X

ai 

i1

n X
n
1X
ai aj yi yj xTi xj
2 i1 j1

14

so that:

0 6 ai 6 C;

n
X

a i yi 0

15

i1

The solution to this minimizations problem is identical to the separable case except for the upper bound C on the Lagrange multipliers ai.

Table 1
The detail of the nine attributes of breast cancer data.
Label

Attribute

Domain

C1
C2
C3
C4
C5
C6
C7
C8
C9

Clump Thickness
Uniformity of Cell Size
Uniformity of Cell Shape
Marginal Adhesion
Single Epithelial Cell Size
Bare Nuclei
Bland Chromatin
Normal Nucleoli
Mitoses

110
110
110
110
110
110
110
110
110

3.2.3. Non-linear separable case-kernel trick


In most cases, the two classes cant be linearly separated. In order to extend the linear learning machine to work well with nonlinear cases, a general idea is introduced, i.e., the original input
space can be mapped into some higher-dimensional feature space
where the training set is separable. With this mapping, the discriminant function is of the follow form:

gx wT /x b

ai /xi T /x b

16

iSV

where xTi x in the input space is represented as the form of u(xi)T


u(x) in the feature space. The functional form of the mapping
u(xi) does not need to be known since it is implicitly dened by
the choice of kernel: K(xi, xj) = u(xi)Tu(xj). Thus, the optimization
problem can be rewritten as:

Maximize

n
X
i1

ai 

n X
n
1X
ai aj yi yj Kxi ; xj
2 i1 j1

17

so that:

0 6 ai 6 C;

n
X

a i yi 0

18
Fig. 1. The scatter plot of the breast cancer data using sher discriminant analysis.

i1

After the optimal values of ai have been found the decision function
is based on the sign of:

gx

ai yi Kxi ; x b

19

iSV

In general, any positive semi-denite functions k(x, y) that satisfy


Mercers condition can be kernel functions (Scholkopf & Smola,
2002). Kernel function is dened as a function that corresponds to
a dot product of two feature vectors in some expanded feature
space. There are many kernel functions that can be used in SVM.
The most commonly used kernels in SVM include the polynomial
kernel:

Kxi ; xj 1 xTi xj p

20

where p is the polynomial order and Gaussian (radial-basis function


(RBF)) kernel:

jjxi  xj jj2
Kxi ; xj exp 
2r2

!
21

where r is the parameter controlling the width of the Gaussian


kernel.
4. Methodology and experiments

parts of the body. Malignant tumors are cancerous. Left unchecked,


malignant cells eventually can spread beyond the original tumor to
other parts of the body. The term breast cancer refers to a malignant tumor that has developed from cells in the breast. (http://
www.breastcancer.org/symptoms/understand_bc/what_is_bc.jsp,
last accessed September 2009).
In this study, we have performed our conduction on the WBCD
taken from UCI machine learning repository (UCI Repository of Machine Learning Databases). The dataset contains 699 instances taken from needle aspirates from patients breasts, where 16
instances have missing values. Because of the small number of
missing data, these cases are discarded from data set and remaining 683 cases are used in our experiment, of which 444 cases belong to benign class and the remaining 239 cases belong to
malignant class. Each record in the database has nine attributes.
These nine attributes were found to differ signicantly between
benign and malignant samples. The nine attributes listed in Table
1 are graded 110, with 10 being the most abnormal state. The
class attribute was represented as 2 for benign and 4 for malignant
cases.
In order to visualize the breast cancer data, the whole data is
projected into two dimensions using the sher discriminant analysis, which is shown in Fig. 1.

4.1. Data collection and description


4.2. RS-based feature selection
Breast cancer is an uncontrolled growth of breast cells. A tumor
can be benign (not dangerous to health) or malignant (has the potential to be dangerous). Benign tumors are not considered cancerous: their cells are close to normal in appearance, they grow
slowly, and they do not invade nearby tissues or spread to other

To nd attribute reducts, our experiments have been executed by


means of a rough set theory-based application ROSETTA (http://
www.lcb.uu.se/tools/rosetta/) developed by Norwegian Scientic
and Technological University and Portland Warsaw University,

Author's personal copy

9018

H.-L. Chen et al. / Expert Systems with Applications 38 (2011) 90149022

disregarded. Then the remaining 7 subsets are retained to be as


the inputs of the SVM Classier.

Table 2
Training set and testing set.
Training-test partition (%)

5050
7030
8020

No. of records in the subset


Training set

Testing set

341
478
546

342
205
137

and the reduct sets are determined by choosing genetic algorithm


(GA). The conduction is performed on the information table (breast
cancer data set) by the GA-based reduction algorithm, and then 20
subsets of attributes are obtained in the end. Since the number of reducts is too high for the further analysis, the choice of few subsets of
attributes or features is required. Therefore, we have computed the
correlation of each condition attribute with the decision attribute
(benign or malignant). The result shows the attribute C6 (Bare Nuclei) obtains the highest value and C9 (Mitoses) gets the lowest
one. It means the attribute C6 has the strongest relevancy and C9
has the weakest relevancy with the decision attribute. According
to John, Kohavi, et al. (1994), for some predictor designs feature relevancy (even strong relevancy) does not imply that a feature must
be in an optimal feature subset, sometimes, a weakly relevant feature my improve prediction accuracy. Hence, we adopt a combination ltering strategy to nd the optimal feature subset, that is, we
choose the subsets of attributes which contain both C6 with the
strongest relevancy and C9 with the weakest relevancy as the optimal subset. According to this approach 13 subsets of attributes are

4.3. Setting model parameters


In addition to the feature selection, proper model parameters
setting can improve the SVM classication accuracy. Values of
parameters in SVM have to be chosen carefully in advance. These
parameters include the following: (1) regularization parameter C,
which determines the tradeoff cost between minimizing the training error and the complexity of the model; (2) parameter gamma
(c or d) of the kernel function which denes the non-linear mapping from the input space to some high-dimensional feature space.
This investigation only considers the Gaussian kernel, the variance
of whose function is gamma squared c2; (3) a kernel function used
in SVM, which constructs a non-linear decision hyperplane in an
input space. In this work, we employ a grid-search technique
(Hsu, Chang, et al., 2003) using 5-fold cross-validation to nd out
the optimal parameter values of RBF kernel function.
In order to ensure the same class distribution in the subset, the
stratied sampling is employed on the 7 reduced feature subsets
obtained in the rst stage by the RS-based feature selection, by
using the stratied sampling the reduced subsets are split into
three training-test partitions, namely, 8020%, 7030% and 50
50%, respectively. The detail of the division is represented in Table
2. Before building the classier, data sets are scaled. With training
and testing data together, we scale each feature to [1, 1]. Then we
perform the 5-fold cross-validation on the 80%, 70% and 50% train-

WDBC Feature space

Feature selection by using the Rough set


attribute reduction algorithm to calculate a
reduct set of D

Select the optimal feature subsets D1

Dividing the subsets D1 into three


partitions through the stratified sampling
method

50-50%training-test set

Initial
(C,g)

70-30%training-test set

80-20%training-test set

Grid search on
(C,g)

Train SVM on the training set using


5-fold cross-validation
Average accuracy
No
Termination Criteria
Yes
Training classifier with the optimal (C,g)
for getting the optimal predictor model

Predict the labels in the according test


subset using the optimal predictor model

Fig. 2. Overall procedure of RS_SVM modeling.

Author's personal copy

H.-L. Chen et al. / Expert Systems with Applications 38 (2011) 90149022


Table 3
Confusion matrix.

Actual positive
Actual negative

Predicted positive

Predicted negative

TP
FP

FN
TN

ing set to choose the proper parameters of C = {25, 23, . . . , 215}


and c = {215, 213, . . . , 21}, respectively. Pairs of (C, c) are tried
and the one with the best cross-validation accuracy is chosen as
the parameter values of RBF kernel. Then the best parameter pair
(C, c) is used to create the model for training. After obtain the predictor model, we conduct the prediction on each testing set accordingly. Since the subset is split randomly, and every random
training-test partition is likely to obtain different best parameter
pair and the cross-validation rate produced by the 5-fold cross-validation. Therefore, In order to make the observation more convincing, in this work, we try to conduct 1000 independent runs of the
experiments for each training-test partition and both the highest
and average classication accuracies are computed, respectively.
Meanwhile, to make sure that the results are not biased, we compute the frequency, namely, the number of times the highest classication accuracy occurs in 1000 conductions as well. The
implementation is carried out via LIBSVM software which is originally designed by Chang and Lin (2001). The evaluation is performed on Intel Quad-Core Xeon 5130 CPU (2.0 GHz) with 4 GB
of RAM.

9019

optimization is performed by using grid search. At the beginning,


RS reduction algorithm is executed on the whole breast cancer data
set and 20 reduced subsets are obtained. Since the number of reduced subsets is so big, in order to reduce the subsets, we attempt
to select some subsets according to a combination ltering strategy, i.e., chose the subsets which contain both the highest and lowest relevancy with the decision feature. At last there are just 7
subsets are retained. In the second phase, each subset of the
remaining 7 subsets is split into three training-test partitions,
8020%, 7030% and 5050% through stratied sampling, respectively. In the third phase, a grid search using 5-fold cross-validation
is carried out on each training set to nd the optimal parameter
pair (C, c). In the fourth phase, the subset is trained with the obtained optimal parameter pair (C, c) to get a predictor model. In
the last phase, the obtained predictor model is used to predict
the instances in each testing set.
4.5. Measure for performance evaluation
In order to evaluate the prediction performance of RS_SVM classier, we dene and compute the classication accuracy, sensitivity, specicity and ROC curves, respectively. The formulations are
as follows:

Accuracy

TP TN
 100%
TP FP FN TN

22

4.4. The architecture of RS_SVM classier

Sensitivity

TP
 100%
TP FN

23

The architecture of RS_SVM which combines the feature selection and parameter optimization is shown in Fig. 2. The feature
selection is done by RS reduction algorithm, and the parameter

Specificity

TN
 100%
FP TN

24

Table 4
Attribute sets identied by RS.
Attribute set number

Attribute sets

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

{Clump Thickness, Single Epithelial Cell Size, Bare Nuclei, Normal Nucleoli}
{Clump Thickness, Uniformity of Cell Shape, Single Epithelial Cell Size, Bare Nuclei}
{Clump Thickness, Uniformity of Cell Size, Bare Nuclei, Normal Nucleoli}
{Uniformity of Cell Shape, Single Epithelial Cell Size, Bare Nuclei, Bland Chromatin}
{Uniformity of Cell Shape, Marginal Adhesion, Bare Nuclei, Normal Nucleoli}
{Clump Thickness, Marginal Adhesion, Bare Nuclei, Bland Chromatin}
{Clump Thickness, Uniformity of Cell Shape, Bare Nuclei, Normal Nucleoli}
{Clump Thickness, Uniformity of Cell Size, Bare Nuclei, Bland Chromatin}
{Uniformity of Cell Size, Single Epithelial Cell Size, Bare Nuclei, Bland Chromatin, Mitoses}
{Uniformity of Cell Size, Single Epithelial Cell Size, Bare Nuclei, Normal Nucleoli, Mitoses}
{Uniformity of Cell Size, Single Epithelial Cell Size, Bare Nuclei, Bland Chromatin, Normal Nucleoli}
{Clump Thickness, Uniformity of Cell Size, Uniformity of Cell Shape, Marginal Adhesion, Bare Nuclei}
{Uniformity of Cell Shape, Marginal Adhesion, Bare Nuclei, Bland Chromatin, Mitoses}
{Uniformity of Cell Size, Uniformity of Cell Shape, Marginal Adhesion, Bare Nuclei, Bland Chromatin}
{Uniformity of Cell Size, Marginal Adhesion, Single Epithelial Cell Size, Bare Nuclei, Normal Nucleoli}
{Clump Thickness, Uniformity of Cell Size, Single Epithelial Cell Size, Bare Nuclei, Mitoses}
{Uniformity of Cell Size, Marginal Adhesion, Single Epithelial Cell Size, Bare Nuclei, Bland Chromatin}
{Clump Thickness, Uniformity of Cell Shape, Marginal Adhesion, Bare Nuclei, Mitoses}
{Clump Thickness, Uniformity of Cell Size, Marginal Adhesion, Bare Nuclei, Mitoses}
{Single Epithelial Cell Size, Bare Nuclei, Bland Chromatin, Normal Nucleoli, Mitoses}

Table 5
The remaining 7 attribute sets.
Attribute set number

Attribute sets

1
2
3
4
5
6
7

{Uniformity of Cell Size, Single Epithelial Cell Size, Bare Nuclei, Bland Chromatin, Mitoses}
{Uniformity of Cell Size, Single Epithelial Cell Size, Bare Nuclei, Normal Nucleoli, Mitoses}
{Uniformity of Cell Shape, Marginal Adhesion, Bare Nuclei, Bland Chromatin, Mitoses}
{Clump Thickness, Uniformity of Cell Size, Single Epithelial Cell Size, Bare Nuclei, Mitoses}
{Clump Thickness, Uniformity of Cell Shape, Marginal Adhesion, Bare Nuclei, Mitoses}
{Clump Thickness, Uniformity of Cell Size, Marginal Adhesion, Bare Nuclei, Mitoses}
{Single Epithelial Cell Size, Bare Nuclei, Bland Chromatin, Normal Nucleoli, Mitoses}

Author's personal copy

9020

H.-L. Chen et al. / Expert Systems with Applications 38 (2011) 90149022

Table 6
Classication accuracies for each subset and different testing sets.
Subset

Classication accuracy (%)


5050% training-test

#1
#2
#3
#4
#5
#6
#7

7030% training-test

8020% training-test

Highest (frequency)

Average

Highest (frequency)

Average

Highest (frequency)

Average

98.24(1)
97.95(9)
99.12(1)
98.53(1)
99.41(1)
98.24(5)
97.66(5)

95.55
95.71
95.99
95.96
96.55
96.14
95.36

99.51(1)
99.02(3)
99.51(2)
99.51(1)
100(1)
99.02(5)
99.51(1)

95.81
95.90
96.27
96.06
96.72
96.14
95.51

100(1)
100(5)
100(7)
100(1)
100(14)
100(3)
99.27(11)

95.96
95.90
96.49
96.06
96.87
96.05
95.64

The highest and average classication of subset #5 for testing results have been shown in bold.

Table 7
The best parameter pair (C, c) and the cross-validation rate k of subset #5.

Table 8
Sensitivity, specicity for subset #5.

Partition

Metrics

5050%
7030%
8020%

21
21
23

21
21
21

95.3079
96.0251
96.8864

5050% trainingtest partition

7030%trainingtest partition

8020% trainingtest partition

Sensitivity (%)
Specicity (%)

99.10
100

100
100

100
100

In Eqs. (22)(24), TP is the number of true positives (benign breast


tumor); FN, the number of false negatives (malignant breast tumor); TN, the number of true negatives; and FP, the number of false
positives. They are dened as a confusion matrix in Table 3.
The ROC curve plots the true positive rate as a function of the
false positive rate. It is parameterized by the probability threshold
values. The true positive rate represents the fraction of positive
cases that are correctly classied by the model. The false positive
rate represents the fraction of negative cases that are incorrectly
classied as positive. Therefore, it provides a trade-off between
sensitivity and specicity.

5. Experimental results and discussion


5.1. Experimental results
To evaluate the effectiveness of the proposed method, we conduct experiments on the Wisconsin breast cancer database. Significant feature subsets are chosen by RS reduction algorithm, and the
SVM parameters are optimized by grid search. The 20 subsets of
attributes chosen by RS reduction algorithm are shown in Table
4. And the 7 subsets chosen by the strategy of combination ltering are shown in Table 5.
The classication accuracies on the testing data for the 7 subsets are shown in Table 6. Among the 7 subsets, subset #5 achieves
the highest classication accuracy, namely, 99.41% for the 5050%
training-test partition, 100% for the 7030% training-test partition,
and 100% for the 8020% training-test partition. The best parameter pairs(C, c) and the cross-validation rate of subset #5 on each
training-test partition is presented in detail in Table 7.
We present values of sensitivity and specicity for subset #5 in
Table 8. The ROC curves for subset #5 are also presented (Figs. 3
5). The areas under the ROC curves (AUC) is computed, and these
values can be used for evaluating the classier performance for different training/test partitions. The bigger area means better classier performance.
Classication results are displayed using a confusion matrix in
Table 9. As we can see from Table 9, the summer of false positives
and false negatives decrease with the increase of the training set
size. Especially, there are no false positives or false negatives for
7030% and 8020% training-test partition.

Fig. 3. ROC curve for 5050% training-test partition.

5.2. Comparative study


In this experiment, we attempt to compare the ve features
identied by RS_SVM with the top-ranked ve features which have
the strong relevancy with the decision attribute. In addition, the
whole nine features are used as the benchmark as well. The correlation of each condition attribute in nine with the decision attribute has been computed. As shown in Table 10, C6 (Bare Nuclei),
C3 (Uniformity of Cell Shape), C2 (Uniformity of Cell Size), C7 (Bland
Chromatin) and C8 (Normal Nuclei) are the top-ranked ve features
which have shown strong relevancy with the decision attribute.
Therefore, C2, C3, C6, C7 and C8 are chosen as the input of SVM model. Table 11 shows the highest and average classication accuracies
of these ve features done by the SVM training/prediction procedure. Furthermore, the frequency of the highest classication in
1000 runs is given as well. Meanwhile, the detail classication
accuracy of the nine features done by the SVM model is shown
in Table 12. And the comparison of the classication accuracies
of three feature subsets is shown in Table 13.

Author's personal copy

9021

H.-L. Chen et al. / Expert Systems with Applications 38 (2011) 90149022


Table 9
Confusion matrixes for subset #5.
Actual

Benign
Malignant
Benign
Malignant
Benign
Malignant

Predicted

Partitions

Benign

Malignant

221
0
134
0
90
0

2
119
0
71
0
47

5050% training-test partition


7030% training-test partition
8020% training-test partition

Table 10
The correlation between condition attributes and decision attribute.

Fig. 4. ROC curve for 7030% training-test partition.

Condition attribute

Correlation with decision attribute

C1
C2
C3
C4
C5
C6
C7
C8
C9

0.71479
0.820801
0.821891
0.706294
0.690958
0.822696
0.758228
0.718677
0.423448

Table 11
Classication accuracies for the top ve relevant features on different partitions.
Subset

Classication accuracy (%)


5050%

{C2C3C6C7C8}

7030%

8020%

Max
(freq)

Avg

Max
(freq)

Avg

Max
(freq)

Avg

98.53(3)

95.90

99.51(2)

96.11

100(1)

96.13

Max, freq and Avg represent highest, frequency and average, respectively.
Fig. 5. ROC curve for 8020% training-test partition.

Table 12
Classication accuracies for the nine features on different partitions.
Subset

Classication accuracy (%)


5050%

{C1C2C3C4C5C6C7C8C9}

7030%

8020%

Max (freq)

Avg

Max (freq)

Avg

Max (freq)

Avg

98.53(4)

96.54

99.51(2)

96.59

100(2)

96.54

Max, freq and Avg represent highest, frequency and average, respectively.

Table 13
Comparison of the classication accuracies of three feature subsets.
Subset

Classication accuracy (%)


5050%

{C1C3C4C6C9}
{C2C3C6C7C8}
{C1C2C3C4C5C6C7C8C9}

7030%

8020%

Max (freq)

Avg

Max (freq)

Avg

Max (freq)

Avg

99.41(1)
98.53(3)
98.53(4)

96.55
95.90
96.54

100(1)
99.51(2)
99.51(2)

96.72
96.11
96.59

100(14)
100(1)
100(2)

96.87
96.13
96.54

Max, freq and Avg represent highest, frequency and average, respectively, and the highest one among three subsets has been shown in bold.

Author's personal copy

9022

H.-L. Chen et al. / Expert Systems with Applications 38 (2011) 90149022

As can be seen from Table 13, the ve features, namely, C1, C3,
C4, C6 and C9 obtained by RS_SVM have performed best in terms
of the highest and average classication. These ve features are
shown to be the most informative features for classifying the
breast cancer. It suggests an important clue for the physicians to
pay much more attention to these ve features, namely, Clump
Thickness, Uniformity of Cell Shape, Marginal Adhesion, Bare
Nuclei and Mitoses for breast cancer diagnosis. We believe the
proposed expert system can be very helpful in assisting the physicians to make the accurate diagnosis on the patients and can show
great potential in the area of clinical diagnosis.
6. Conclusion and future work
This work has explored a new expert system, RS_SVM, for breast
cancer diagnosis. Experiments on different portions of the WBCD
demonstrated that RS_SVM performed well in distinguishing the
benign breast tumor from the malignant one. It was observed that
the proposed method achieved the highest classication accuracies
(99.41%, 100%, and 100% for 5050% of training-test partition, 70
30% of training-test partition, and 8020% of training-test partition,
respectively) for a subset that contained ve features (subset #5).
Meanwhile, comparative experiment was conducted on the topranked ve relevant features and the whole nine features, the results showed that the ve features identied by RS_SVM outperformed the other two feature subsets in terms of the highest and
average classication accuracy. In addition, a combination of ve
features (i.e., Clump Thickness, Uniformity of Cell Shape, Marginal Adhesion, Bare Nuclei and Mitoses) for classifying breast
tumors was identied to be most informative by RS-based reduction algorithm. It implied that these ve features were worthwhile
to be taken close attention by the physicians when they conducted
the diagnosis.
We believe the promising results demonstrated by the method
(RS_SVM) in classifying the breast cancer can ensure that the physicians can make very accurate diagnostic decision. Future investigation will pay much attention to evaluate the proposed RS_SVM
in other larger breast cancer datasets. In addition, since the performance of SVM depends greatly on the model parameters, developing a more efcient approach to identify the optimal model
parameters should also be examined in our future work.
Acknowledgements
This research is supported by the National Natural Science
Foundation of China (NSFC) under Grant Nos. 60603030,
60873149, 60973088, 60773099, 60703022, and the National
High-Tech Research and Development Plan of China under Grant
Nos. 2006AA10Z245, 2006AA10A309. This work is also supported
by the Open Projects of Shanghai Key Laboratory of Intelligent
Information Processing in Fudan University under the Grand No.
IIPL-09-007, and the Erasmus Mundus External Cooperation Windows Project (EMECW): Bridging the Gap, under Grant No.
155776-EM-1-2009-1-IT-ERAMUNDUS-ECW-L12. 2009-2012.

References
Abonyi, J., & Szeifert, F. (2003). Supervised fuzzy clustering for the identication of
fuzzy classiers. Pattern Recognition Letters, 24(14), 21952207.
Boser, B. E., Guyon, I. M., et al. (1992). A training algorithm for optimal margin
classiers. In Fifth annual workshop on computational learning theory. Pittsburgh:
ACM.
Chang, C. C., & Lin, C. J. (2001). LIBSVM: A library for support vector machines.
Software available at www.csie.ntu.edu.tw/~cjlin/libsvm.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3),
273297.
Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines:
And other kernel-based learning methods. Cambridge, UK: Cambridge University
Press.
Frohlich, H., & Chapelle, O., et al. (2003). Feature selection for support vector
machines by means of genetic algorithms. In Proceedings of the 15th IEEE
international conference on tools with articial intelligence, Sacramento, CA, USA,
pp. 142148.
Goodman, D., Boggess, L., et al. (2002). Articial immune system classication of
multiple-class problems. Intelligent Engineering Systems Through Articial Neural
Networks, Fuzzy Logic, Evolutionary Programming Complex Systems and Articial
Life, 12, 179184.
Hassanien, A. E. (2004). Rough set approach for attribute reduction and rule
generation: A case of patients with suspected breast cancer. Journal of the
American Society for Information Science and Technology, 55(11), 954962.
Hsu, C. W., & Chang, C. C., et al. (2003). A practical guide to support vector
classication. Technical report, Department of Computer Science and
Information Engineering, National Taiwan University, Taipei. Available at
http://www.csie.ntu.edu.tw/cjlin/libsvm/.
Joachims, T., & Nedellec, C., et al. (1998). Text categorization with support vector
machines: Learning with many relevant. In Proceedings of the 10th European
conference on machine learning, pp. 137142.
John, G. H., & Kohavi, R., et al. (1994). Irrelevant features and the subset selection
problem. In Proceedings of the 11th international conference on machine learning.
Karabatak, M., & Ince, M. C. (2009). An expert system for detection of breast cancer
based on association rules and neural network. Expert Systems with Applications,
36(2, Part 2), 34653469.
Kryszkiewicz, M., & Rybinski, H., (1996). Attribute reduction versus property
reduction. In Proceedings of the fourth European congress on intelligent techniques
and soft computing, pp. 204208.
Maglogiannis, I., Zaropoulos, E., et al. (2009). An intelligent system for automated
breast cancer diagnosis and prognosis using SVM based classiers. Applied
Intelligence, 30(1), 2436.
Osuna, E., Freund, R., & Girosi, F. (1997). Training support vector machines:
Application to face detection. In Proceedings of computer vision and pattern
recognition, Puerto Rico, pp. 130136.
Pawlak, Z. (1982). Rough sets. International Journal of Parallel Programming, 11(5),
341356.
Pawlak, Z. (1996). Why rough sets. In Proceedings of the IEEE international conference
on fuzzy system.
Pawlak, Z. (1997). Rough set approach to knowledge-based decision support.
European Journal of Operational Research, 99(1), 4857.
Pena-Reyes, C. A., & Sipper, M. (1999). A fuzzy-genetic approach to breast cancer
diagnosis. Articial Intelligence in Medicine, 17(2), 131155.
Polat, K., & Gunes, S. (2007). Breast cancer diagnosis using least square support
vector machine. Digital Signal Processing, 17(4), 694701.
Quinlan, J. R. (1996). Improved use of continuous attributes in C 4.5. Journal of
Articial Intelligence Research, 4(77-90), 325.
Sahan, S., Polat, K., et al. (2007). A new hybrid method based on fuzzy-articial
immune system and k-nn algorithm for breast cancer diagnosis. Computers in
Biology and Medicine, 37(3), 415423.
Scholkopf, B., & Smola, A. J. (2002). Learning with kernels: Support vector machines,
regularization, optimization, and beyond. The MIT Press.
Setiono, R. (2000). Generating concise and accurate classication rules for breast
cancer diagnosis. Articial Intelligence in Medicine, 18(3), 205219.
Ubeyli, E. D. (2007). Implementing automated diagnostic systems for breast cancer
detection. Expert Systems with Applications, 33(4), 10541062.
UCI Repository of Machine Learning Databases. www.archive.ics.uci.edu/ml/
machine-learning-databases/breast-cancer-wisconsin/.
Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer.
Vapnik, V. (1998). Statistical learning theory. New York: Wiley.

All in-text references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately.

You might also like