You are on page 1of 5

International Journal of Applied Research and Studies (iJARS) ISSN: 2278-9480 Volume 2, Issue 5 (May - 2013) www.ijars.

in

Research Article

A Study on Various Clustering Techniques for DNA Micro Arrays Based Gene Expression Data
Author:
1

K. Sathishkumar*, 2Dr.V.Thiagarasu, 3M.Ramalingam


Address For correspondence:

1, 3 2

Assistant Professor of Computer Science, Gobi Arts & Science College, Gobichettipalayam, India Associate Professor of Computer Science, Gobi Arts & Science College, Gobichettipalayam, India
genes of unknown functions and the discovery of functional relationships between genes [21]. A deoxyribonucleic acid (DNA) microarray is a collection of microscopic DNA spots attached to a solid surface, such as glass, plastic or silicon chip forming an array. DNA microarray technologies are an essential part of modern biomedical research. This technology aims at the measurement of mRNA levels in particular cells or tissues for many genes at once. To this end, single strands of balancing DNA for the genes of interest which can be immobilized on spots arranged in a grid on a support which will typically be a glass slide, a quartz wafer, or a nylon membrane. Measuring the quantity of label on each spot then yields an intensity value that should be correlated to the abundance of the corresponding RNA transcript in the sample [24]. Normally, microarray experiments create a large number of datasets with expression values for thousands of genes but still not more than a few dozens of samples, thus very accurate arrangement of tissue samples in such high dimensional problems is a complicated task [10]. Also, there is a high redundancy in microarray data as well as several genes have irrelevant information for exact clustering of diseases or phenotypes [15].Clustering of genes into groups sharing common characteristics is a useful exploratory technique for a number of subsequent computational analyses. A wide range of clustering algorithms have been proposed in particular to analyze gene expression data, but most of them consider genes as independent entities or include relevant information on gene interactions in a suboptimal way. But, still there is always a space for improvement in the performance of the clustering algorithms. Therefore, a robust clustering method is indispensable to retrieve the gene information from the microarray experimental data. sathishmsc.vlp@gmail.com *Corresponding Author Email-Id 1

Abstract Recent advances in DNA microarray technology

helps in obtaining gene expression profiles of tissue samples at fairly low costs. The amount of biological data such as DNA sequences and microarray data have been increased tremendously. DNA microarrays are emerged as the leading technology to measure gene expression levels primarily, because of their high throughput. Cluster analysis of gene expression data has proved to be a useful tool for identifying co-expressed genes. Information retrieval and data mining are the powerful tools to extract information from the databases and/or information repositories. The integrative cluster analysis of both clinical and gene expression data has shown to be an effective alternative to overcome problems such as less clustering accuracy, higher clustering time etc. There have been quite a few approaches proposed for the gene expression techniques. This work presents a brief survey of different clustering approaches of gene expression data techniques and relative study of these techniques. In this paper an analysis of different techniques used for gene expression data has been made. Finally, a suitable clustering technique has been suggested.
Keywords- Clustering, Data Mining, Gene Expression Data,

Microarray I. INTRODUCTION This Cluster analysis of gene expression data has proved to be a useful tool for identifying co-expressed genes. DNA microarrays are emerged as the leading technology to measure gene expression levels primarily, because of their high throughput. Results from these experiments are usually presented in the form of a data matrix in which rows represent genes and columns represent conditions [5]. Each entry in the matrix is a measure of the expression level of a particular gene under a specific condition. Analysis of these data sets reveals

Manuscript Id: iJARS/492

International Journal of Applied Research and Studies (iJARS) ISSN: 2278-9480 Volume 2, Issue 5 (May - 2013) www.ijars.in

This paper focuses to analyze various available clustering techniques to determine the best suitable and effective clustering approaches for gene expression data. II. LITERATURE SURVEY A number of clustering algorithms have been developed to improve the preceding clustering algorithms, unraveling the problems and fit for specific fields [8]. There is no absolute clustering method that can be universally used to solve all problems. So in order to select or generate a suitable clustering strategy, it is vital to investigate the features of the problem. As Xu and Wunsch indicated that the clustering algorithm selection is combined with the selection of a corresponding proximity measure and the construction of a criterion function [25]. Patterns are grouped according to whether they resemble each other. Once a proximity measure is selected, the construction of a clustering condition function makes the partition of clusters an optimizing problem. K-means is a form of partition-based clustering technique mainly utilized in clustering gene expression data [14]. Kmeans is well known for its simplicity and speed. It performs quite well on large datasets. However, it may not provide the identical result with each run of the algorithm. It is observed that, K-means is very good at handling outliers but its performance is not satisfactory in detecting clusters of random shapes. The main objective in cluster analysis is to group objects that are similar in one cluster and separate objects that are dissimilar by assigning them to different clusters. One of the most popular clustering methods is K-Means clustering algorithm. It classifies object to a pre-defined number of clusters, which is given by the user (assume K clusters). The idea is to choose random cluster centres, one for each cluster. These centres are preferred to be as far as possible from each other. In this algorithm mostly Euclidean distance is used to find distance between data points and centroids [7]. The Euclidean distance between two multi-dimensional data points and is described as follows: The K-Means method aims to minimize the sum of squared distances between all points and the cluster centre. This procedure consists of the following steps, as described below. Algorithm 1: K-Means clustering algorithm [12] Require: D = {d1, d2, d3, ..., dn } // Set of n data points. K - Number of desired clusters Ensure: A set of K clusters. Steps: 1. Arbitrarily choose k data points from D as initial centroids; 2. Repeat Assign each point di to the cluster which has the closest centroid; Manuscript Id: iJARS/492

Calculate the new mean for each cluster; Until convergence criteria is met. Though the K-Means algorithm is simple, it has some drawbacks of quality of the final clustering, since it highly depends on the arbitrary selection of the initial centroids. A Self Organizing Map (SOM) is more robust than Kmeans for clustering noisy data [22]. Due to the noisy data there would be some miscalculation in the accuracy. The input required is the number of clusters and the grid layout of the neuron map. Prior identification of the number of clusters is tough for the gene expression data. Furthermore, partitioning approaches are restricted to data of lower dimensionality, with intrinsic well-separated clusters of high density. Thus partitioning approaches do not perform well on high dimensional gene expression data sets with intersecting and embedded clusters. A hierarchical structure can also be built on SOM based on Self-Organizing Tree Algorithm (SOTA) [6]. Fuzzy Adaptive Resonance Theory (Fuzzy ART) [23] is another form of SOM which measures the coherence of a neuron (e.g., vigilance criterion). The output map is accustomed by splitting the existing neurons or adding new neurons into the map, until the coherence of each neuron in the map satisfies a user specified threshold. In many applications, the expert interpretation of coclustering is easier than for mono-dimensional clustering. Coclustering aims at computing a bi-partition that is a collection of co-clusters: each co-cluster is a group of objects associated to a group of attributes and these associations can support interpretations. Many constrained clustering algorithms have been proposed to exploit the domain knowledge and to improve partition relevancy in the mono-dimensional case (e.g., using the so-called must-link and cannot-link con-straints). Coclustering has been considered not only for extended must-link and cannot-link constraints (i.e., both objects and attributes can be involved), but also for interval constraints that enforce properties of co-clusters when considering ordered domains. The resultant state of an iterative co-clustering algorithm has been proposed which exploits user-defined constraints while minimizing the sum-squared residues [19]. Brazma et al. [3] and Ball et. al [2] discussed the importance of establishing a standard for recording and reporting microarray-based gene expression data and proposed a Minimum Information about a Microarray Experiment (MIAME) that describes the minimum information required to ensure that microarray data can be easily interpreted and that results derived from its analysis can be independently verified. Kuo et al. [11] compared two high-throughput cDNA microarray technologies, Stanford type (i.e., spotted) cDNA microarrays and Affymetrix oligonucleotide microarrays and showed that corresponding mRNA measurements from the two platforms showed poor correlation. Further, their results suggest gene-specific, or more precisely, probe-specific factors 2

International Journal of Applied Research and Studies (iJARS) ISSN: 2278-9480 Volume 2, Issue 5 (May - 2013) www.ijars.in

influencing measurements differently in the two platforms, implying a poor prognosis for a broad utilization of gene expression measurements across platforms. Nimgaonkar et al. [17] studied the reproducibility of gene expression levels across two generations of Affymetrix GeneChips and concluded that although experimental replicates are highly reproducible, the reproducibility across generations depends on the degree of similarity of the probe sets and the expression level of the corresponding transcript. Probabilistic model that has the advantage taken into account for individual data (e.g., expression) and pair wise data (e.g., interaction information from biological networks) simultaneously. The model is based on hidden Markov random field models in which parametric probability distributions account for the distribution of individual data. Data on pairs, possibly reflecting distance or similarity measures between genes, are then included through a graph, where the nodes represent the genes, and the edges are weighted according to the available interaction information. As a probabilistic model, this model has many interesting theoretical features. In addition, preliminary experiments on simulated and real data show promising results and points out the gain in using such an approach [13]. Alberto Cozzini et al. proposed a penalized mixture of Student's t distributions for model-based clustering and gene ranking [1]. Together with a bootstrap procedure, the proposed approach provides a means for ranking genes according to their contributions to the clustering process. Experimental results show that the algorithm performs well comparably to traditional Gaussian mixtures in the presence of outliers and longer tailed distributions. The algorithm also identifies the true informative genes with high sensitivity, and achieves improved model selection. An illustrative application to breast cancer data is also presented which confirms established tumor subclasses. Jiabin Deng et al. proposed an improved fuzzy clusteringtext clustering method based on the fuzzy C-Means clustering algorithm and the edit distance algorithm [9]. The author used the feature evaluation to reduce the dimensionality of highdimensional text vector. Because the clustering results of the traditional fuzzy C-Means clustering algorithm lack the stability, the author introduced the high-power sample point set, the field radius and weight. Due to the boundary value attribution of the traditional fuzzy C-Means clustering algorithm, the author recommended the edit distance algorithm. Celikyilmaz et.al proposed a new fuzzy system modeling approach based on improved fuzzy functions to model systems with continuous output variable [4]. The new modeling approach introduces three features: i) an Improved Fuzzy Clustering (IFC) algorithm, ii) a new structure identification algorithm, and iii) a nonparametric inference engine. The IFC algorithm yields simultaneous estimates of parameters of cregression models, together with fuzzy c-partitioning of the Manuscript Id: iJARS/492

data, to calculate improved membership values with a new membership function. The structure identification of the new approach utilizes IFC, instead of standard fuzzy C-Means clustering algorithm, to fuzzy partition the data, and it uses improved membership values as additional input variables along with the original scalar input variables for two different choices of regression methods: least squares estimation or support vector regression, to determine ldquo fuzzy function srdquo for each cluster. With novel IFC, one could learn the system behavior more accurately compared to other FSM models. The nonparametric inference engine is a new approach, which uses the alike -nearest neighbor method for reasoning. Pal et al., proposed the fuzzy-possibilistic C-Means (FPCM) technique and algorithm that generated both membership and typicality values when clustering unlabeled data [18]. FPCM constrains the typicality values so that the sum over all data points of typicalitys to a cluster is one. For large data sets the row sum constraint produces unrealistic typicality values. In this approach, a new model is presented called Possibilistic-Fuzzy C-Means (PFCM) model. PFCM produces memberships and possibilities concurrently, along with the usual point prototypes or cluster centers for each cluster. PFCM is a hybridization of FCM and Possibilistic CMeans (PCM) that often avoids various problems of PCM, FCM and FPCM. The noise sensitivity defect of FCM is resolved in PFCM, overcomes the coincident clusters problem of PCM and eliminates the row sum constraints of FPCM. The first-order essential conditions for extreme of the PFCM objective function is driven, and used them as the basis for a standard alternating optimization approach to find local minima of the PFCM objective function. PFCM prototypes are less sensitive to outliers and can avoid coincident clusters; PFCM is a strong candidate for fuzzy rule-based system identification An EM (Expectation maximization) algorithm is very use full in statically model. The most common algorithm uses an iterative refinement technique. These algorithms are giving the best result in clustering method; it is also referred to as LoyardAlgo particularly in the computer science community. EM algorithms given an initial set of c-means , the algorithm proceeds by alternating between two steps [16].FPCM constructs memberships and possibilities simultaneously, along with the usual point prototypes or cluster centers for each cluster. Hybridization of PCM and FCM is the FPCM using EM Algorithm that often avoids various problems of PCM, FCM and FPCM using EM. FPCM using EM solves the noise sensitivity defect of FCM, overcomes the coincident clusters problem of PCM. But the estimation of centroids is influenced by the noise data. Hence Fuzzy Possibilistic C-Means Algorithm using EM Algorithm (EMFPCM) [20] has been proposed.

International Journal of Applied Research and Studies (iJARS) ISSN: 2278-9480 Volume 2, Issue 5 (May - 2013) www.ijars.in

S. No

Authors

Techniques

REFERENCES
[1] Alberto Cozzini, Ajay Jasra & Giovanni Montana, Robust model -based clustering with gene ranking, arXiv:1201.5687v1 [stat.ME] 27 Jan 2012 Ball CA, Sherlock G, Parkinson H, et al. Standards for microarray data Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Science 2002; 298: 539. Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, Aach J, Ansorge W, Ball CA, Causton HC, Gaasterland T, Glenisson P, Holstege FC, Kim IF, Markowitz V, Matese JC, Parkinson H, Robinson A, Sarkans U, Schulze-Kremer S, Stewart J, Taylor R, Vilo J, Vingron M.Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat Genet 2001; 29: 365-71. Celikyilmaz A and Burhan Turksen I, Enhanced Fuzzy System Models With Improved Fuzzy Clustering Algorithm, IEEE Transactions on Fuzzy Systems, Vol. 16, No. 3, Pp. 779 794, 2008. De K.R and Bhattacharya .A, Divisive Correlation Clustering Algorithm (DCCA) for grouping of genes: detecting varying Patterns in expression profiles, bioinformatics, Vol. 24, pp.1359 - 1366, 2008. Dopazo .J and Carazo .J, Phylogenetic reconstruction using an unsupervised neural network that adopts the topology of a phylogenetic tree, J Mol Eval, vol. 44, p p. 226233, 1997. Doulaye Dembele and Philippe Kastner, Fuzzy C means method for clustering microarray data, Bioinformatics, vol.19, no.8, pp.973 - 980, 2003. Fasheng Liu and Lu Xiong, Survey on text clustering algorithm Research present situation of text clustering algorithm, IEEE 2nd International Conference on Software Engineering and Service Science (ICSESS), 2011. Jiabin Deng, JuanLi Hu, Hehua Chi and Juebo Wu, An Improved Fuzzy Clustering Method for Text Mining, Second International Conference on Networks Security Wireless Communications and Trusted Computing (NSWCTC), Vol. 1, Pp. 65 69, 2010. Jian J. Dai, Linh Lieu, and David Rocke, "Dimension reduction for classification with gene expression microarray data," Statistical Applications in Genetics and Molecular Biology, Vol. 5, No. 1, pp. 1 21, 2006. Kuo WP, Jenssen TK, Butte AJ, Ohno-Machado L, Kohane IS. Analysis of matched mRNA measurements from two different microarray technologies. Bioinformatics 2002; 18: 405-12. Madhu Yedla, Srinivasa Rao Pathakota, Srinivasa T M, Enhancing K Means Clustering Algorithm with Improved Initial Center, Madhu Yedla et al. / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 1 (2), pp121-125, 2010. Matthieu Vignes and Fl orence Forbes, Gene Clustering via Integrated Markov Models Combining Individual and Pairwise Features:, M TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 6, NO. 2, APRIL-JUNE 2009. McQueen .J, Some methods for classification and analysis of multivariate observations, in Proceedings of the Fifth Berkeley Symp. Math. Statistics and Probability, vol. 1, 1967, pp. 281 297. Napoleon .D, Pavalakodi .S, A New Method for Dimensionality Reduction using K-Means Clustering Algorithm for High Dimensional Data Set, International Journal of Computer Applications Volume 13 No.7,pp. 41-46 January 2011 Neal, Radford; Hinton, Geoffrey (1999). Michael I. Jordan. Ed."A view of the EM algorithm that justifies incremental, sparse, and other variants". Learning in Graphical Models (Cambridge, MA: MIT Press): 355368.

1. 2.

Doulaye Dembele and Philippe Kastner Madhu Yedla, Srinivasa Rao Pathakota, Srinivasa TM Tamayo .P, Slonim .D, Mesirov .J, Zhu .Q, Kitareewan .S, Dmitrovsky .E, Lander .E, and Golub .T Dopazo .J and Carazo .J, Tomida .S, Hanai .T, Honda .H, and Kobayashi .T Ruggero G. Pensa, and Jean-Franois Boulicaut Alberto Cozzini, Ajay Jasra & Giovanni Montana Jiabin Deng, JuanLi Hu, Hehua Chi and Juebo Wu Celikyilmaz A and Burhan Turksen I

Fuzzy partitioning method, Fuzzy C-means (FCM) Method for finding the better initial centroids .

[2]

[3]

3.

The application of self-organizing maps


[4]

4.

A new type of unsupervised, growing, self-organizing neural network Clustering method based on adaptive theory An iterative co-clustering algorithm Model-Based Clustering and Gene Ranking An improved fuzzy clustering-text clustering method An improved fuzzy clustering (IFC) algorithm, a new structure identification algorithm, and A nonparametric inference engine.

[5]

5. 6. 7.

[6]

[7]

[8]

8.

[9]

9.

[10]

[11]

III. CONCLUSION Clustering is a method of grouping similar types of data. And it is very useful method applied in various applications. As an essential tool for data exploration, cluster analysis investigates unlabeled data, by either framing a hierarchical structure, or forming a set of groups based on a prespecied number. This paper focuses on the clustering algorithms and reviews a wide variety of approaches available in the literature. These algorithms were developed to solve different problems, and have their own pros and cons. Though a number of clustering techniques have already been developed, there are still several problems such as less clustering accuracy and higher clustering time consumption. For each technique, a detailed explanation of the techniques can be given which are used for gene expression data. Fuzzy Possibilistic C-Means Algorithm using EM Algorithm is a recent developed technique which can be more accurate and it overcomes all the problem of the other existing techniques. Hence, thorough research in FCM using EM would help improve the overall clustering performance. Manuscript Id: iJARS/492

[12]

[13]

[14]

[15]

[16]

International Journal of Applied Research and Studies (iJARS) ISSN: 2278-9480 Volume 2, Issue 5 (May - 2013) www.ijars.in

[17] Nimgaonkar A, Sanoudou D, Butte AJ, Haslett N, Kunkel M, Beggs H and Kohane S. Reproducibility of gene expression across generations of Affymetrix microarrays. BMC Bioinformatics 2003; 4: 27. [18] Pal N.R, Pal K, Keller J.M. and Bezdek J.C, A Possibilistic Fuzzy c Means Clustering Algorithm, IEEE Transactions on Fuzzy Systems, Vol. 13, No. 4, Pp. 517 530, 2005. [19] Ruggero G. Pensa, and Jean-Franois Boulicaut, Constrained Co clustering of Gene Expression Data, SIAM International Conference on Data Mining - SDM , pp. 25-36, 2008. [20] Shanthi .R and Suganya .R, Enhancement of Fuzzy Possibilistic C Means Algorithm using EM Algorithm (EMFPCM), International Journal of Computer Applications (0975 8887) Volume 61 No.12, January 2013 [21] Sunnyvale, Schena M. Microarray biochip technology. CA: Eaton Publishing; 2000. [22] Tamayo .P, Slonim .D, Mesirov .J, Zhu .Q, Kitareewan .S, Dmitrovsky .E, Lander .E, and Golub .T, Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation, in Proceedings of National Academy of Sciences, vol. 96(6), USA, 1999, pp. 2907 2912. [23] Tomida .S, Hanai .T, Honda .H, and Kobayashi .T, Analysis of expression profile using fuz zy adaptive resonance theory, Bioinformatics, vol. 18(8), pp. 1073 83, 2002. [24] Wolfgang Huber, Anja von Heydebreck and Martin Vingron, Analysis of microarray gene expression data, Max -Planck-Institute for Molecular Genetics 14195 Berlin April 2, 2003. [25] Xu .R and Wunsch D. Survey of clustering Algorithms, IEEE Trans on Neural Networks. Vol. 16, no. 3, pp.645-678, 2005.

Manuscript Id: iJARS/492

You might also like