You are on page 1of 5

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 2, FEBRUARY 2011, ISSN 2151-9617

HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 112

Acute Leukemia Cancer Classification using


Single Genes
B.B.M.Krishna Kanth, Dr.U.V.Kulkarni, and Dr B.G.V.Giridhar

Abstract—Gene expression profiling provides tremendous information to resolve the complexity of cancer. The selection of the
most informative single genes from microarray datasets for cancer classification has become important issue in recent times,
along with predicting the classification accuracy of such identified genes using various classifiers. We propose a new method of
classification system namely, the fuzzy hypersphere clustering neural network (FHCNN) which combines clustering and
classification inorder to differentiate cancer tissues such as acute myeloid leukemia (AML) and acute lymphoblastic leukemia
(ALL). Experimental results show that our FHCNN model using one outstanding gene, Zyxin achieves the best classification
accuracy of 94.12% where as other state-of-art methods could reach the best accuracy of 91.18%. Morever FHCNN is more
stable, and contains less number of parameter adjustments compared to all the other classification methods.

Index Terms— spearman correlation coefficient, fuzzy hypersphere, neural network, classification.

——————————  ——————————

1 INTRODUCTION

C LASSIFICATION of patient samples presented as gene


expression profiles has become the subject of exten-
sive study in biomedical research in recent years.
portant genes can help researchers to concentrate on these
genes and investigate the mechanisms for cancer devel-
opment and treatment. It may bring down the cost of la-
One of the most common approaches is binary classifica- boratory tests, because a patient needs to be tested on
tion, which distinguishes between two types of samples: only few genes, rather than thousands of genes. Further-
positive, or case samples (taken from individuals that carry more, it may be possible to obtain simple rules for doctors
some illness), and negative, or control samples (taken from to make diagnosis without even using a classifier or a
healthy individuals). Supervised learning offers an effec- computer. If we survey and examine the established re-
tive means to differentiate positive from negative sam- ports in this field, we will find that almost all the accurate
ples: a collection of samples with known type labels is classification results are obtained using more than a sin-
used to train a classifier that is then used to classify new gle gene. Recently, Wang. X et al. [9] proposed a rough
samples. Microarrays [1, 2] allow simultaneous measure- set based soft computing method to conduct cancer classi-
ment of tens of thousands of gene expression levels per fication using single genes. However, multi-gene models
sample. It has changed biomedical research in a profound suffer from the disadvantage that it is not easy to assess
way and has rapidly emerged as a major tool to obtain which gene is more important in the models, because they
gene expression profiles of human cancers [3, 4]. Since the are run on the basis of a group of genes. As a result, the
development of microarray technology, many data min- significant biomarkers of related cancers are hard to be
ing approaches [18,19,20,21,22,24] have been developed to detected. In addition, multi-gene models are prone to
analyze microarray data. Because typical microarray stu- impart the difficulty in understanding the models them-
dies usually contain less than one hundred samples, the selves. In this article, we explore the classification of can-
number of features (genes) in the data far exceeds the cer on the basis of single genes with leukemia dataset
number of samples. This asymmetry of the data poses a using our proposed FHCNN model. We want to unders-
serious challenge for standard learning algorithms–that core that sufficiently accurate classification can be
can be overcome by selecting a subset of the features and achieved, and important biomarkers can be found with
using only them in the classification. Generally, the mi- ease and efficiently by using single-gene models.
croarray data analysis includes two key procedures: gene
selection and classifier construction. From biological and
clinical points of view, finding the small number of im-
2 METHODS

———————————————— 2.1 Leukemia Data Set


• B.B.M.Krishna kanth is with the S.G.G.S.I.E T, Nanded, 431606, Mahara- The number of training samples in leukemia dataset is 38
stra, India.
• Dr.U.V.Kulkarni is with the S.G.G.S.I.E T, Nanded,431606, Maharastra, which of them contain 27 samples of ALL class and 11
India samples of AML class; the number of testing samples is 34
• Dr.B.G.V.giridhar is with the Andhra medical college,Visakhapatnam where 20 samples belong to ALL and remaining 14 sam-
,530002, Andhra pradesh, India
ples belongs to AML class respectively. The data set is
available at http://www-genome.wi.mit.edu/cancer/.
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 2, FEBRUARY 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 113

2.2 Gene Selection The threshold input of HS, denoted by T is set to one and
In order to score the similarity of each gene, an ideal fea- it is weighted by ζ that represents radius of the HS,
ture vector [5] is defined. It is a vector consisting of 0’s in which is created during the training process. The CPs of
one class (ALL) and 1’s in other class (AML). It is defined the HSs are stored in the matrix C. The radii of the HSs
as follows: created during training process are bounded in the
ideali =(0,0,0,…..,0,1,……,1,1,1) (1) range 0 ≤ ζ ≤ 1 . The maximum size of hypersphere is
bounded by a user defined value λ , where 0 ≤ λ ≤ 1 , the λ
The similarity of gi and gideal using Spearman correlation is called as growth parameter that is used for controlling
coefficient (SCC) [6] is defined as follows maximum size of the hypersphere and puts maximum
n limit on the radius of the hypersphere.
∑ (ideal − g )
2
6 i i
SCC= 1 − i =1
(2) Let the training set is, R ∈ {Rh 1, 2,....., P} , where
h=
n × ( n − 1)
2
= Rh ( rh1 , rh 2 , rh3 ,....., rhn ) ∈ I n is the hth pattern, and the
Where n is the number of samples; gi is the ith real value
membership function of the hypersphere node m j is de-
of the gene vector and ideali is the corresponding ith bi-
fined as
nary value of the ideal feature vector.

3 FUZZY HYPERSPHERE CLUSTERING NEURAL


( )
m j Rh , C j , ζ = 1 − f ( l , ζ , γ ) (3)

NETWORK
where f ( ) is three-parameter ramp threshold function
The FHCNN consists of two layers as shown in the Fig 1. defined as
The FR layer accepts an input pattern and consists of n
 0, if (0 ≤ l ≤ ζ ) 
processing elements, one for each dimension of the pat-
tern. The FC layer consists of q processing nodes that are f ( l , ζ , γ ) = (l − ζ )γ , if (ζ ≤ l ≤ 1)  (4)
constructed during training and each node represents  1 if (l > 1) 
fuzzy set hypersphere (HS) which is characterized by its
membership function. The processing performed by HS and the argument l is defined as,
node is shown in the Fig 2. The weights between the FR
1/2
and FC layer represent centre points (CPs) of the HSs.  n 2 (5)
=l  ( c − rhi )
 ∑ ji
 .
(
As shown in the Fig 2, C j = c j1 , c j 2 , c j 3 ,......., c jn )  i =1

represents CP of the HS m j .

T=1

ζ
m1 m2 m3 mq Fc Layer
r1
cj1

Matrix C and ζ
cj2
r2 mj = f ( Rh , C j , ζ )
r1 r2 rn FR Layer mj
cjn,

( rh1 ,....., rh 2 ,........., rhn ) rn

Input layer patterns


Fig. 2. Implementation of Fuzzy Hypersphere Clustering Neural Net-
Fig 1. Fuzzy Hypersphere Clustering Neural Network. work.
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 2, FEBRUARY 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 114

The membership function returns m j =1, if the input pat- count is selected as a centroid or CP of the hypersphere.
The process of selecting CP of the cluster is described be-
terns Rh is contained by the hypersphere. The parameter
low.
γ , 0 ≤ γ ≤ 1 , is a sensitivity parameter, which governs
how fast the membership value decreases outside the  P 
hypersphere when the distance between Rh and C j in-  
If  Ri − R j ≤ ζ  then Di =+
Di 1 for i =
1, 2, 3,..., P, (6)
creases. The sample plot of membership function with  
 j =1 
centre point [0.5 0.5] and radius equal to 0.3 is shown in
Fig 3. It can be observed that the membership values de- where Ri and R j are the ith and jth patterns respectively
crease steadily with increase in distance from the hyper- in the dataset R , D is a P -dimensional vector and Di is
sphere. Each node of FC layer represents a cluster. The ith element of vector D which contains number of pat-
output of the jth FC node represents the fuzzy degree terns falling around ith pattern whose euclidean distance
with which the input pattern belongs to the cluster m j . is less than or equal to ζ . To find the pattern with the
maximum count the equation (7) is used in which Dmax is
the maximum value in the row vector D , and Dind is an
index of the maximum value.
[ Dmax Dind ] = max[ D ] (7)
The pattern Rind from the dataset R is the most appro-
priate and chosen as a CP of the first hypersphere m1 .
The hypersphere m1 returns fuzzy membership value
equal to one for the patterns which fall around the se-
lected centre point Rind with the distance less than or
equal to ζ . Hence, these patterns are grouped in a cluster
and the pattern Rind acts as CP of the created cluster. The
weight assigned to the synapses of the created hyper-
sphere is described using equation (8).
Fig. 3. Plot of Fuzzy Hypersphere Membership Function with center C1 = Rind (8)
point CP [0.5 0.5], radius ζ =0.3 and sensitivity parameter γ =1.
3.3 Removal of Grouped Patterns in a Hypersphere

The clustered patterns in the previous step are eliminated


3.1 The FHCNN Learning Algorithm and the next pass uses the remaining unclustered patterns
to create new hyperspheres.
The HSs created in the FC layer represent clusters. The Let R p , Rc and Rn represent set of patterns used in the
ζ is a radius of the HSs created. It is user defined and current pass, set of patterns clustered in the current pass
bounded by 0 ≤ ζ ≤ 1 . The number of clusters or hyper- and set of patterns that will be used in the next pass, re-
spheres constructed depends on the parameter ζ . The spectively. Then Rn can be described as,
value of ζ is problem dependent and should be mod-
{
Rn = R p − Rc = Rn | Rn ∈ R p and Rn ∉ Rc } (9)
erately low so that HSs created will include the patterns
which are close to each other and possibly fall in the The Rn calculated in the current pass becomes R p for the
same cluster. The FHCNN learning algorithm consists of next pass. Above two steps are repeated until all the pat-
two steps for the creation of the HSs during training terns are clustered. Each node of FC layer constructed
process. The steps in the learning algorithm are given by during training represents a cluster and gives a soft deci-
(1) Finding hypersphere centre points. (2) Removal of sion. The output of kth FC node represents the degree to
patterns grouped in the hypersphere.
which the input pattern belongs to the cluster mk .

3.2 Finding Hypersphere centre Points


4 RESULTS AND DISCUSSION
To determine the centre points of the cluster all the pat-
terns are applied to each of the pattern and the patterns We evaluated the proposed approach on the leukemia
with euclidean distance less than or equal to ζ are dataset, which consists of 72 samples (38 training sam-
counted for all the patterns. The pattern with maximum ples, 34 testing samples) each described by 7129 attributes
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 2, FEBRUARY 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 115

(genes). The pathological classes (targets) to be predicted TABLE 2


are ALL (acute lymphoblastic leukemia) and AML (acute
myeloid leukemia). As a preprocessing step, we ranked COMPARISON OF THE BEST CLASSIFICATION ACCURACY WITH
all the 7129 genes using Spearman Correlation Coefficient LEUKEMIA DATA SET

(SCC) scoring approach. We picked out 10 genes with the


largest SC values from the training samples to do the Methods # Genes #Correctly-classified
classification. Table 1 shows these 10 genes. We input samples(Accuracy)
these genes one by one to the FHCNN according to their Proposed 1 32 (94.1 %)
ranks. During all the experiments using the FHCNN, the Wang. X. et al. [9] 1 32 (94.1 %)
parameter γ is set to 1 and ζ is adjusted to tune the per- Tong et al.[21] 2 31 (91.2 %)
formance to get maximum possible accuracy by varying Xu. R. et al.[24] 5 31 (91.2 %)
number of created HSs. When we trained FHCNN with
Sun et al. [12 ] 1 31 (91.2 %)
38 patterns of the gene Zyxin with ζ equal to 0.2, it
Banerjee et al. [13] 9 31 (91.2 %)
created four clusters. After that, the FHCNN perfor-
mance is assessed on the independent 34 testing samples Li et al. [14 ] 1 31 (91.2 %)
for classification. This process is repeated for all the re- Tan et al. [15 ] 1038 31 (91.2 %)
maining selected genes. Among the top 10 genes, the top Wang. Y. et al. [12] 1 31 (91.2 %)
four genes having Gene ids #4847, #1882, #1834 and #760 Cong. G. et al. [16] 10-40 31 (91.2 %)
were among the biologically instructive genes identified
Golub et al. [10] 50 29 (85.3 %)
earlier by many other approaches [7-12]. Moreover, when
considering the performance of the selected genes and Furey et al.[17] 25-1000 30-32 (88.2%-94.1%)
FHCNN with each class separately, the five genes with
Gene ids #760, #1834, #4373, #6855 and #3252 showed
100% best classification accuracy for samples related to
TABLE 3
ALL class, and the Gene id #4377 attained 100% best
classification accuracy for samples belonging to AML
COMPARISON OF THE BEST CLASSIFICATION ACCURACY WITH
class in leukemia dataset respectively. LEUKEMIA DATA SET USING ONE OUTSTANDING GENE ZYXIN
TABLE 1
Methods #Correctly-classified
TOP 10 GENES WITH THE BEST CLASSIFICATION ACCURACY samples (Accuracy %)
USING FHCNN AND SCC GENE SELECTION METHOD Proposed 32 (94.1 %)
Gene id #Correctly classified Classification Kulkarni. U.V. et al. [20] 31 (91.2 %)
samples accuracy (%)
Wang. X. et al. [9] 31 (91.2 %)
(ALL/AML) (ALL/AML)
Wang. Y. et al. [11] 31 (91.2 %)
4847 32 (19/13) 94.12 (95/92.86)
1882 32 (19/13) 94.12 (95/92.86) Sun. L. et al. [12 ] 31 (91.2 %)
760 32 (20/12) 94.12 (100/85.71) Li .J. et al. [14 ] 31 (91.2 %)
1834 32 (20/12) 94.12 (100/85.71)
Li. W. et al. [22 ] 31 (91.2 %)
2402 29 (18/11) 85.29 (90/78.57)
Frank. et al. [23] 30-31 (88.24%-91.18%)
4373 31 (20/11) 91.18 (100/78.57)
6855 30 (20/10) 88.24 (100/71.43)
6041 31 (19/12) 91.18 (95/85.71)
3252 31 (20/11) 91.18 (100/78.57)
5 CONCLUSION
4377 30 (16/14) 88.24 (80/100)
The proposed work mainly focused on classification of
The leukemia dataset has been well studied by many re- acute leukemia using only single genes, particulary using
searchers. Regarding the leukemia dataset, the best classi- one outstanding top gene, Zyxin. A learning strategy
fication accuracy results reported in our and some other which combines clustering and fuzzy classifier with fuzzy
works are shown in Table 2. If using single genes, our hypersphere membership function was used for predict-
accuracy is the highest among all the methods, and the ing the class of cancer. Zyxin was selected as a top ranked
other methods must use far more genes to achieve our gene using Spearman coefficient gene selection method
classification accuracy. Using one common gene Zyxin, and using this single gene our method could achieve the
best classification accuracy of 94.12%(2 out of 34 test
until now all other previously published methods [9, 11,
samples are classified wrongly) where as all others could
12, 14, 20, 22] could reach the best classification accuracy
achieve only 91.18%(3 out of 34 test samples are classified
of 91.18%, whereas our proposed method could achieve
wrongly). Our future work will look upon training the
94.12% best classification accuracy which is shown in Ta-
new patterns adaptively without retraining again along
ble 3.
with already trained patterns and removing overlapping
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 2, FEBRUARY 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 116

of hyperspheres of different classes so that it may help to cover ing rule groups for gene expression data. In Proceedings
increase the classification accuracy to a greater extent. of the ACM SIGMOD International Conference on Manage-
ment of Data, 670-681.
[17] Furey, T. S., Cristianini, N., Duffy, N., Bednarski, D. W.,
REFERENCES Schummer, M. and Haussler, D. (2000). Support vector ma-
chine classification and validation of cancer tissue samples us-
ing microarray expression data. Bioinformatics 16(10):906-914.
[1] Schena, M., Shalon, D., Davis, R.W. and Brown P.O. (1995).
[18] Khan, J., Wei, J. S., Ringner, M., Saal, L. H., Ladanyi, M., Wes-
Quantitative monitoring of gene expression patterns with a
termann, F., Berthold, F., Schwab, M., Antonescu, C. R., Peter-
complementary dna microarray. Science 270, 467-470.
son, C. and Meltzer, P.S. (2001). Classification and diagnostic
[2] DeRisi, J., Penland, L., Brown, P.O., Bittner, M. L., Meltzer, P. S.,
prediction of cancers using gene expression profiling and artifi-
Ray, M., Chen, Y., Su, Y.A. and Trent J. M. (1996). Use of a
cial neural networks. Nature Medicine 7, 673–679.
cDNA microarray to analyze gene expression patterns in hu-
[19] Tan, A. H., and Pan, H. (2005). Predictive neural networks for
man cancer. Nature Genetic 14, 457-60.
gene expression data analysis. Neural networks 18, 297-306.
[3] Shi, T., Liou, L. S., Sadhukhan, P., Duan, Z. H., Hissong, J.,
[20] Kulkarni, U. V., and Sontakke, T. R. (2001). Fuzzy hypersphere
Almasan, A., Novick, A. and DiDonato, J.A. (2004). Effects of
neural network classifier. In 10th IEEE International conference
resveratrol on gene expression in renal cell carcinoma. Cancer
on fuzzy systems, 1559-1562.
Biology and Therapy 3, 882-888.
[21] Tong, D. L., Phalp, K. T., Schierz, A. C. and Mintram, R. (2009).
[4] Liou, L. S., Shi, T., Duan, Z. H., Sadhukhan, P., Der, S. D., No-
Innovative hybridization of genetic algorithms and neural net-
vick, A., Hissong, J., Almasan, A. and DiDonato, J.A. (2004).
works in detecting marker genes for leukemia cancer. In 4th
Microarray gene expression profiling and analysis in renal cell
IAPR International Conference in Pattern Recognition for Boin-
carcinoma. BMC Urology, 4-9.
formatics, Sheffield, UK.
[5] Chen, Y., and Zhao, Y. (2008). A novel ensemble of classifiers for
[22] Li, W., and Yang, Y. (2002). How many genes are needed for a
microarray data classification. Applied Soft Computing
discriminant microarray data analysis. Methods of Microarray
8(4):1664–1669.
Data Analysis. Kluwer Academic Publishers. pp 137-150.
[6] Devore, J. L. (1995). Probability and statistics for engineering
[23] Frank, E., Hall, M., Trigg, L., Holmes, G. and Witten, I.H. (2004).
and the sciences. 4th edition. Duxbury Press, California.
Data mining in bioinformatics using Weka. Bioinformatics 20,
[7] Li, D., and Zhang, W. (2006). Gene selection using rough set
2479–2481.
theory. In Proceedings of the 1st International Conference on
[24] Xu, R., Anagnostopoulos. and Wunsch, D. (2002). Tissue classi-
Rough Sets and Knowledge Technology, 778–785.
fication through analysis of gene expression data using a new
[8] Momin, B. F., and Mitra, S. (2006). Reduct generation and classi-
family of ART architectures. In Proceedings of International
fication of gene expression data. In Proceedings of the 1st In-
Joint Conference on Neural Networks 1, 300-304
ternational Conference on Hybrid Information Technology,
699-708.
[9] Wang, X., and Gotoh, O. (2009). Cancer classification using sin- .
gle genes. Genome Informatics 23(1):176-88. B.B.M.Krishna Kanth received the B.E. degree in Electronics and
[10] Golub, T.R., Slonim, D., Tamayo, K. P., Huard, C., Gaasen- Communication Engineering from Andhra University, in the year
1999. He received the M.E. degree in Computer Technology from
beek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., S.R.T.M. University, Nanded in the year 2002. He is currently work-
Caligiuri, M. A., Bloomfield, C. D. and Lander, E. S. (1999). Mo- ing as Research Scholar in the Departmant of Computer Science
lecular classification of cancer: Class discovery and class predic- Enginering, SGGS Institute of Engineering and Technology, Nanded,
Maharastra. His current research interests include various aspects of
tion by gene expression monitoring. Science 286, 531–537.
Neural Networks and Fuzzy Logic, DNA analysis and Bioinformatics.
[11] Wang, Y., Tetko, I. V., Hall, M. A., Frank, E., Facius, A., Mayer,
K.F. and Mewes, H. (2005). Gene selection from microarray da- Dr.U.V.Kulkarni received the PhD degree in Electronics and Com-
ta for cancer classification—a machine learning approach. puter Science Engineering from S.R.T.M. University, Nanded in the
Computational Biology and Chemistry 29(1):37-46. year 2003. He is currently working as Head of Computer Science
Engineering Department and Dean of academics, SGGS Institute of
[12] Sun, L., Miao, D., and Zhang, H. (2008). Efficient gene selection Engineering and Technology, Nanded, Maharastra.
with rough sets from gene expression data. In Proceedings of
the 3rd International Conference on Rough Sets and Know- Dr.B.G.V.Giridhar received the Doctor of Medicine (D.M) (post-
doctorate) degree in Endocrinology and Doctor of Medicine (M.D) in
ledge Technology, 164–171.
General Medicine from Andhra Medical College, Visakhapatnam, in
[13] Banerjee, M., Mitra, S. and Banka, H. (2007). Evolutionary- the year 2006 and 2000 respecticely. He is currently working as As-
rough feature selection in gene expression data. IEEE Transac- sistant Professor, Andhra Medical College, Visakhapatnam.
tion on Systems, Man, and Cybernetics, Part C: Application and
Reviews 37, 622–632.
[14] Li, J., and Wong, L. (2002). Identifying good diagnostic gene
groups from gene expression profiles using the concept of
emerging patterns. Bioinformatics 18(5):725-734.
[15] Tan, A. C., and Gilbert, D. (2003). Ensemble machine learning
on gene expression data for cancer classification. Applied Bioin-
formatics 2(3 Suppl):75-83.
[16] Cong, G., Tan, K.L., Tung, A. and Xu, X. (2005). Mining top-k