You are on page 1of 3

See

discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/221001479

FGKA: a Fast Genetic K-means Clustering


Algorithm.

Conference Paper January 2004


DOI: 10.1145/967900.968029 Source: DBLP

CITATIONS READS

97 238

5 authors, including:

Shiyong Lu Susan J Brown


Wayne State University Kansas State University
108 PUBLICATIONS 1,280 CITATIONS 388 PUBLICATIONS 6,857 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Shiyong Lu on 18 March 2017.

The user has requested enhancement of the downloaded file. All in-text references underlined in blue are added to the original document
and are linked to publications on ResearchGate, letting you access and read them immediately.
POSTER ABSTRACT
FGKA: A Fast Genetic K-means Clustering Algorithm
Yi Lu, Shiyong Lu, Farshad Fotouhi Youping Deng, Susan J. Brown
Department of Computer Science Division of Biology
Wayne State University Kansas State University
Detroit, MI 48202, USA Manhattan, KS 66506, USA
{luyi, shiyong, fotouhi}@cs.wayne.edu {ydeng, sjbrown}@ksu.edu

ABSTRACT optimum. To demonstrate the effectiveness and efficiency of


FGKA, we performed a comparative study on gene expression
In this paper, we propose a new clustering algorithm called Fast data analysis using FGKA, GKA and K-means respectively. Our
Genetic K-means Algorithm (FGKA). FGKA is inspired by the experiments indicate that both FGKA and GKA have better
Genetic K-means Algorithm (GKA) proposed by Krishna and optimum convergence than K-means algorithm, but FGKA runs
Murty in 1999 but features several improvements over GKA. Our much faster than GKA. We proposed FGKA in the context of
experiments indicate that, while K-means algorithm might clustering gene expression data. However, the algorithm itself is a
converge to a local optimum, both FGKA and GKA always generic clustering method that can be applied to other domains as
converge to the global optimum eventually but FGKA runs much well.
faster than GKA.

Categories and Subject Descriptors 2. THE OBJECTIVE FUNCTION


H.2.8 [Database Management]: Database Applications Data The gene expression data for clustering consists of N genes and
mining. their corresponding N patterns. Each pattern is a vector of D
dimensions recording the expression levels of the genes under
each of the D monitored conditions or at each of the D time
General Terms points. The goal of FGKA algorithm is to partition the N patterns
Algorithms, Measurement, Performance, Experimentation. into user-defined K groups, such that this partition minimizes the
Total Within-Cluster Variation (TWCV, also called square-error
Keywords in the literature), which is defined as follows.
Clustering, Genetic algorithm, K-means algorithm Let X1, X2,, XN be the N patterns, and Xnd denotes the dth
feature of pattern Xn (n=1N). Each partitioning is represented
1. INTRODUCTION by a string, a sequence of numbers a1aN, where an takes a value
In recent years, clustering algorithms have been effectively from {1, 2,, K} representing the cluster number that pattern Xn
applied in molecular biology for gene expression data analysis. belongs to. Let Gk denote the kth cluster and Zk denote the number
By clustering algorithms such as K-means, hierarchical of patterns in Gk. The Total Within-Cluster Variation (TWCV) is
clustering, SOM, genes are partitioned into groups based on the N D K 1 D
similarity between their expression profiles. In this way,
defined as TWCV = X nd 2 SF kd 2 where
n = 1 d =1 k =1 Z k d =1
functionally related genes are identified. As the amount of
SFkd is the sum of the dth features of all the patterns in Gk. Due to
laboratory data in molecular biology grows exponentially each
space limit, we omit the induction and readers are referred to [3]
year due to advanced technologies such as Microarray, new
for details.
efficient and effective clustering methods must be developed to
process this growing amount of biological data. 3. FAST GENETIC K-MEANS ALGORITHM
In this paper, we propose a new clustering algorithm called FGKA maintains a population (set) of Z coded solutions
Fast Genetic K-means Algorithm (FGKA). FGKA is inspired by (partitions), where Z is a parameter specified by the user. Each
the Genetic K-means Algorithm (GKA) but features several solution is coded by a string a1aN of length N. Given a solution
improvements over GKA. One salient feature of FGKA (as well Sz = a1aN, we define the legality ratio of Sz, e(Sz), as the number
as GKA) is that it will always converge to the global optimum, of non-empty clusters in Sz divided by K. Sz is legal if e(Sz)=1,
whereas the K-means algorithm might converge to a local and illegal otherwise.
FGKA starts with the initialization phase, which generates
the initial population P0. The population in the next generation
Permission to make digital or hard copies of all or part of this work for Pi+1 is obtained by applying the following genetic operators
personal or classroom use is granted without fee provided that copies are sequentially: the selection, the mutation and the K-means operator
not made or distributed for profit or commercial advantage and that
on the current population Pi. The evolution takes place until the
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists, termination condition is reached.
requires prior specific permission and/or a fee. The initialization phase randomly generates the initial
SAC04, March 14-17, 2004, Nicosia, Cyprus. population P0 of Z solutions which might end up with illegal
Copyright 2004 ACM 1-58113-812-1/03/04$5.00. strings. At first sight, illegal strings are undesirable. For this
reason, the GKA algorithm [2] makes significant effort to 3.3 The K-means Operator
eliminate illegal strings. Illegal strings, however, are permitted in In order to speed up the convergence process, one step of the
FGKA, but are considered as the most undesirable solutions by classical K-means algorithm, which we call K-means operator
defining their TWCVs as + and assigning them with lower (KMO) is introduced. Given a solution that is encoded by a1aN,
fitness values. Our flexibility of allowing illegal strings in the we replace an by an for n=1,,N simultaneously, where an is the
evolution process avoids the overhead of illegal string number of the cluster whose centroid is closest to Xn in Euclidean
elimination, and thus improves the time performance of the distance.
algorithm. In the following, we give a brief description of the
three genetic operators. To account for illegal strings, we define d(Xn,ck) = + if the
kth cluster is empty. This definition is different from section 3.2,
in which we defined d(Xn,ck) = 0 if the kth cluster is empty. The
3.1 The Selection Operator motivation for this new definition here is that we want to avoid
We use the so-called proportional selection for the selection
reassigning all patterns to empty clusters. Therefore, illegal string
operator in which, the population of the next generation is
will remain illegal after the application of KMO.
determined by Z independent random experiments. Each
experiment randomly selects a solution from the current
population {S1, S2,, Sz} according to the probability distribution 4. EXPERIMENTAL RESULTS
Our experiments were conducted on a Dell Dimension 8200 PC
F ( Sz )
{p1,p2,,pZ} defined by pz = ( z = 1, L , Z ), where machine with 2.4G Hz CPU and 512M RAM. Three algorithms,
Z
z =1 F ( Sz ) the K-means algorithm, the original GKA algorithm and our
F(Sz) denotes the fitness value of solution Sz. In our context, the proposed FGKA algorithm were implemented in C language with
objective is to minimize the Total Within-Cluster Variation Microsoft Visual Studio tool. Two data sets are used to conduct
(TWCV). Therefore, solutions with smaller TWCVs should have our experiments, the fig2data [1] and chodata [4].
higher probabilities for survival and should be assigned with The experiments show that K-means algorithm might
greater fitness values. In addition, illegal strings are less desirable converge to a local optimum, both FGKA and GKA always
and should have lower probabilities for survival, and thus should converge to the global optimum but FGKA runs almost 20 times
be assigned with lower fitness values. We define F(Sz) as follows, faster than GKA. It also shows that three improvements: efficient
calculation of TWCVs, avoiding illegal string elimination
1.5 * TWCV max TWCV ( Sz ), if Sz is leagal overhead, and the simplification of the mutation operator have
F ( Sz ) = different improvement impact factor over GKA. More details are
e( Sz ) * F min, otherwise available in [3].
where TWCVmax is the maximum TWCV that has been
encountered till the present generation, Fmin is the smallest fitness 5. CONCLUSIONS
value of the legal strings in the current population if they exist, In this paper, we propose a new clustering algorithm called Fast
otherwise Fmin is defined as 1. Genetic K-means Algorithm (FGKA). FGKA is inspired by the
Genetic K-means Algorithm (GKA) [2] but features several
3.2 The Mutation Operator improvements over it, including efficient calculation of TWCVs,
The mutation operator performs the function of shaking the avoiding illegal string elimination overhead, and the
algorithm out of a local optimum, and moving it towards the simplification of the mutation operator. The initialization phase
global optimum. During mutation, we replace an by an for and the three operators are redefined to achieve these
n=1,,N simultaneously. an is a cluster number randomly improvements.
selected from {1,,K} with the probability distribution
{p1,p2,,pK} defined by
6. REFERENCES
[1] V. Iyer, M. Eisen, D. Ross, G. Schuler, T. Moore, J. Lee, J.
1 .5 * d max ( X n ) d ( X n , c k ) + 0 .5 Trent, L. Staudt, J. Hudson, M. Boguski, D. Lashkari, D.
pk = K Shalon, D. Botstein, and P. Brown. The transcriptional
k =1 (1 . 5 *d max ( X n ) d ( X n , c k ) + 0 .5) program in the response of human fibroblasts to serum.
SCIENCE, 283, January 1999.
where d(Xn,ck) is the Euclidean distance between pattern Xn and
the centroid ck of the kth cluster, and [2] K. Krishna and M. Murty. Genetic K-means algorithm. IEEE
d max( Xn) = max {d ( Xn, ck )} . If the kth cluster is empty, then Transactions on Systems, Man and Cybernetics Part B:
k Cybernetics, 29(3):433439, June 1999.
d(Xn,ck) is defined as 0. The bias 0.5 is introduced to avoid divide-
[3] Y. Lu, S. Lu, F. Fotouhi, Y. Deng, and S. Brown. Fast
by-zero error in the case that all patterns are equal and are
genetic K-means algorithm and its application in gene
assigned to the same cluster in the given solution.
expression data analysis. Technical Report TR-DB-06-2003,
First of all, the above mutation operator ensures that an http://www.cs.wayne.edu/~luyi/publication/tr0603.pdf,
arbitrary solution, including the global optimum, might be 2003.
generated by the mutation from the current solution with a
positive probability. Second, it encourages that each Xn is moving
[4] S. Tavazoie, J.D. Hughes, M.J. Campbell, R.J. Cho, and
G.M. Church. Systematic determination of genetic network
towards a closer cluster with a higher probability. Third, it
architecture. Nature Genetics, 22:281285, 1999.
promotes the probability of converting an illegal solution to a
legal one.

View publication stats

You might also like