You are on page 1of 17

Geneticalgorithmsforlargescaleclusteringproblems

(published in The Computer Journal, 40 (9), 547-554, 1997)

PasiFrnti1,JuhaKivijrvi2,TimoKaukoranta2andOlliNevalainen2
DepartmentofComputerScience,
UniversityofJoensuu,PB111
FIN80101Joensuu,FINLAND
Email:franti@cs.joensuu.fi

TurkuCentreforComputerScience(TUCS)
DepartmentofComputerScience
UniversityofTurku
Lemminkisenkatu14A
FIN20520Turku,FINLAND

Abstract
Weconsidertheclusteringprobleminacasewherethedistancesofelementsare
metricandboththenumberofattributesandthenumberoftheclustersarelarge.
Geneticalgorithmapproachgivesinthisenvironmenthighqualityclusteringsat
the cost of long running time. Three new efficient crossover techniques are
introduced. The hybridization of genetic algorithm and kmeans algorithm is
discussed.
Indexing terms: clustering problem, genetic algorithms, vector quantization,
image compression, color image quantization.

1.Introduction
Clusteringisacombinatorialproblemwheretheaimistopartitionagivensetofdata
objectsintoacertainnumberofclusters[1,2].Inthispaperweconcentrateonlarge
scaledatawherethenumberofdataobjects(N),thenumberofconstructedclusters(M),
andthe number of attributes (K) are relatively high. Standard clustering algorithms
workwellforverysmalldatasetsbutoftenperformmuchworsewhenappliedtolarge
scaleclusteringproblems.Ontheotherhand,itisexpectedthatamethodthatworks
wellforlargescaledatawouldalsoworkwellforproblemsofsmallerscale.
Clustering includes the following three subproblems: (1) the selection of the cost
function,(2)thedecisionofthenumberofclassesusedintheclustering,and(3)the
choiceoftheclusteringalgorithm.Weconsideronlythelastsubproblemandassume
thatthenumberofclasses(clusters)isfixedbeforehand.Insomeapplication(suchas
vectorquantization[3])thequestionismerelyaboutresourceallocation,i.e.howmany
classescanbeafforded.Thedatasetitselfmaynotcontainclearlyseparateclustersbut
theaimistopartitionthedataintoagivenamountofclusterssothatthecostfunctionis
minimized.
Manyoftheclusteringalgorithmscanbegeneralizedtothecasewherethenumberof
classesmustalsobesolved.Forexample,theclusteringalgorithmcanberepeatedly
appliedforthedatausingallreasonablenumberofclusters.Theclusteringbestfitting
thedataisthenchosenaccordingtoanysuitablecriterion.Thedecisionistypically

madebytheresearcheroftheapplicationareabutanalyticalmethodshavealsobeen
considered. For example, by minimizing the stochastic complexity [4] one can
determine the clustering for which the entropy of the intracluster diversity and the
clusteringstructureisminimal.
Duetothehighnumberofdataobjectsweuseametricdistancefunctioninsteadof
adistancematrixtoapproximatethedistancesbetweentheobjects.Theattributesofthe
objectsareassumedtobenumericalandofthesamescale.Theobjectscanthusbe
consideredaspointsinaKdimensionalEuclideanspace.Theaimoftheclusteringin
thepresentworkistominimizetheintraclusterdiversity(distortion).
Optimizationmethodsareapplicabletotheclusteringproblem[5].Acommonproperty
ofthesemethodsisthattheyconsiderseveralpossiblesolutionsandgenerateanew
solution(orasetofsolutions)ateachsteponthebasisofthecurrentone.
Inageneticalgorithm(GA)[6]weuseamodelofthenaturalselectioninreallife.The
ideaisthefollowing.Aninitialpopulationofsolutionscalledindividualsis(randomly)
generated. The algorithm creates new generations of the population by genetic
operations,suchasreproduction,crossoverandmutation.Thenextgenerationconsists
ofthepossiblesurvivors(i.e.thebestindividualsofthepreviousgeneration)andofthe
newindividualsobtainedfromthepreviouspopulationbythegeneticoperations.
Geneticalgorithmshavebeenconsideredpreviouslyfortheclusteringprobleminvector
quantizationbyDelportandKoschorreck[7],andbyPan,McInnesandJack[8].Vector
quantizationwasappliedtoDCTtransformedimagesin[7],andtospeechcodingin
[8].Scheunders[9]studiedgeneticalgorithmsforthescalarquantizationofgrayscale
images,MurthyandChowdhury[10]forthegeneralclusteringproblem.Thesestudies
concentrateonspecialapplications[7,8,9],orthealgorithmshavebeenappliedtovery
smallscaledatasets[10]only,andthereisnoguaranteethatthemethodsworkforlarge
scale problems in different application domain. In addition, the parameters of the
proposedmethodsshouldbestudiedinmoredetail.
Inthispaperwepresentasystematicstudyongeneticalgorithmsfortheclustering
problem.Inthedesignofthealgorithms,thekeyquestionsare:
Representationofthesolution.
Selectionmethod.
Crossovermethod.
TheefficiencyoftheGAishighlydependentonthecodingoftheindividuals.Inour
caseanaturalrepresentationofasolutionisapair(partitioningtable,clustercentroids).
Thepartitioningtabledescribesforeachdataobjecttheindexoftheclusterwhereit
belongs. The cluster centroids are representative objects of the clusters and their
attributesarefoundbyaveragingthecorrespondingattributesamongtheobjectsinthe
particularcluster.
Threemethods forselecting individuals forcrossoverareconsidered: aprobability
basedmethodandtwoelitistvariants.Inthefirstone,acandidatesolutionischosento
2

crossoverwithaprobabilitythatisafunctionofitsdistortion.Inthelattertwovariants
onlythebestsolutionsareacceptedwhiletherestaredropped.
Forthecrossoverphase,wediscussseveralproblemorientedmethods.Theseinclude
twopreviouslyreported(randomcrossover [7,10]and centroiddistance [8,9])and
three new techniques (pairwise crossover, largest partitions and pairwise nearest
neighbor).Itturnsoutthat,duetothenatureofthedata,noneofthestudiedmethodsis
efficientwhenusedalonebuttheresultingsolutionsmustbeimprovedbyapplyingfew
stepsoftheconventionalkmeansclusteringalgorithm[11].Inthishybridmethod,new
solutionsarefirstcreatedbycrossoverandthenfinetunedbythekmeansalgorithm.In
fact,allpreviouslyreportedGAmethods[710]includetheuseofkmeansinaformor
another.
Therestofthepaperisorganizedasfollows.InSection2wediscusstheclustering
problem,theapplicationsanddatasetsoftheproblemarea.Essentialfeaturesofthe
GAsolution are outlined in Section 3. Results of the experiments are reported in
Section4.Acomparisontootherclusteringalgorithmsismade.Finally,conclusionsare
drawninSection5.

2.Clusteringproblemandapplications
Letusconsiderthefollowingsixdatasets: Bridge, Bridge2, MissAmerica, House,
Lates mariae, and SS2, see Fig.1. Due to our vector quantization and image
compression background, the first four data sets originate from this context. We
considerthesedatasetsmerelyastestcasesoftheclusteringproblem.
In vector quantization, the aim is to map the input data objects (vectors) into
arepresentative subset of the vectors, called codevectors. This subset is referred as
acodebook and it can be constructed using any clustering algorithm. In data
compressionapplications,reductioninstoragespaceisachievedbystoringtheindexof
thenearestcodevectorinsteadofeachoriginaldatavector.Moredetailsonthevector
quantizationandimagecompressionapplicationscanbefoundin3, 12, 13.
Bridge consistsof44spatialpixelblockssampledfromthegrayscaleimage(8bits
perpixel).Eachpixelcorrespondstoasingleattributehavingavalueintherange0,
255.Thedatasetisverysparseandnoclearclusterboundariescanbefound.Bridge2
hastheblocksofBridgeafteraBTClikequantizationintotwovaluesaccordingtothe
averagepixelvalueoftheblock 14.Theattributesofthisdatasetarebinaryvalues
(0/1)whichmakes itanimportant special casefortheclustering.Accordingtoour
experiments,mostoftheexistingmethodsdonotapplyverywellforthiskindofdata.
Thethirddataset(MissAmerica)hasbeenobtainedbysubtracting twosubsequent
imageframesoftheoriginalvideoimagesequence,andthenconstructing44spatial
pixel blocks from the residuals. Only the first two frames have been used. The
applicationofthiskindofdataisfoundinvideoimagecompression[15].Thedatasetis

similartothefirstsetexceptthatthedataobjectsarepresumablymoreclustereddueto
themotioncompensation(subtractionofsubsequentframes).
Thefourthdataset(House)consistsofthe RGB colortuplesfromthecorresponding
color image. This data could be applied for palette generation in color image
quantization 16, 17.Thedataobjectshaveonlythreeattributes(red,greenandblue
colorvalues)butthereareahighnumberofsamples(65536).Thedataspaceconsistsof
asparsecollection ofdataobjects spreadintoawidearea,buttherearealsosome
clearlyisolatedandmorecompactclusters.
Thefifthdataset(Latesmariae)records215datasamplesfrompelagicfishesonLake
Tanganyika.Thedataoriginatesfromaresearchofbiology,wheretheoccurrenceof52
differentDNAfragmentsweretestedfromeachfishsample(usingRAPDanalysis)and
abinarydecisionwasobtainedwhetherthefragmentwaspresentorabsent.Thisdata
has applications in studies of genetic variations among the species [18]. From the
clusteringpointofviewthesetisanexampleofdatawithbinaryattributes.Duetoonly
amoderatenumberofsamples(215),thedatasetisaneasycasefortheclustering,
comparedtothefirstfoursets.
ThesixthdatasetisthestandardclusteringtestproblemSS2of[2],pp.103104.The
data sets contain 89 postal zones inBavaria (Germany) and their attributes are the
numberofselfemployed people,civilservants,clerks andmanualworkersinthese
areas. The dimensions of this set are rather small in comparison to the other sets.
However,itisacommonlyuseddatasetandserveshereasanexampleofatypical
smallscaleclusteringproblem.
ThedatasetsandtheirpropertiesaresummarizedinTable1.Intheexperimentsmade
here,wewillfixthenumberofclustersto256fortheimagedatasets,8fortheDNA
dataset,and7fortheSS2dataset.

Bridge
(256256)

Bridge2
(256256)

MissAmerica
(360288)

House
(256256)

Figure1.Sourcesforthefirstfivedatasets.
Table1.Datasetsandtheirstatistics.
Dataset
Bridge
Bridge2
MissAmerica

Attributes
16
16
16

#Objects
4096
4096
6480

#Clusters
256
256
256

Latesmariae
(52215)

House
Latesmariae
SS2

3
52
4

65536
215
89

256
8
7

Intheclusteringproblemwearegivenaset Q X i i 1, , N of K dimensional
vectors X i X 1 i , X 2 i , , X K i . A clustering C1 , C2 , , C M of Q has the
properties
a) Ci ,fori=1,,M
b) Ci C j ,for i j and
M
c) Q i 1 Ci
Each cluster Ci (i=1,,M) has a representative element, called the centroid Z i ,
Z

X Ci

Ci . Here Ci stands for the number of objects in Ci and the

summationisovertheobjectsXwhichbelongtothecluster Ci .
AssumingthatthedataobjectsarepointsofanEuclideanspace,thedistancebetween
twoobjects X ( i ) and X ( j ) canbedefinedby:
d X (i ) , X ( j )

X
k 1

(i )
k

X k( j )

(1)

where X k( i ) and X k( j ) stand forthe kth attribute of the objects. Let f X i bea
mapping which gives the closest centroid in solution for a sample X ( i ) . The
distortionofthesolutionforobjects X ( i ) is
distortion( )

1 N
d X (i ) , f X (i )

NK i 1

(2)

Theproblemistodetermineaclustering 0 forwhich distortion( 0 ) min .


Whenweuse(1)asthedistancemeasureweassumethattheattributesinthedataset
arenumericalandhavethesamescale.Thisisthecaseforourfirstfivedatasets,but
wenotethatitdoesnotholdforthesixthset.Inthiscasetheattributesarescaledin
ordertohavesimilarvalueranges.Formula(2)measuresthedistortionofasolutionby
themeansquaredistanceofthedataobjectsandtheirclustercentroids.Again,thisis
onlyoneofpossibledistortionmeasures.

3.Geneticalgorithm
ThegeneralstructureofGAisshowninFig.2.Eachindividualofthepopulationstands
foraclusteringofthedata.AnindividualisinitiallycreatedbyselectingMrandomdata
objects as cluster representatives andbymapping all the otherdataobjects totheir

nearestrepresentative,accordingto(1).Ineachiteration,apredefinednumber(SB)of
bestsolutionswillsurvivetothenextgeneration.Therestofthepopulationisreplaced
bynewsolutionsgeneratedinthecrossoverphase.Wewilldiscussthedifferentdesign
alternativesofthealgorithminthefollowingsubsections.
1. Generate S random solutions for the initial generation.
2. Iterate the following T times
2.1. Select the SB surviving solutions for the next generation.
2.2. Generate new solutions by crossover.
2.3. Generate mutations to the solutions.
3. Output the best solution of the final generation.

Figure2.Asketchforageneticalgorithm.

3.1Representationofasolution
Asolutiontotheclusteringproblemcanbeexpressedbythepair(partitioningtable,
clustercentroids).Thesetwodependoneachothersothatifoneofthemhasbeen
given, the optimal choice of the other one can be uniquely constructed. This is
formalizedinthefollowingtwooptimalityconditions[3]:
Nearest neighbor condition: For a given set of cluster centroids, any data

objectcanbeoptimallyclassifiedbyassigningittotheclusterwhosecentroidis
closesttothedataobjectinrespecttothedistancefunction.

Centroidcondition:Foragivenpartition,theoptimalclusterrepresentative,

thatistheoneminimizingthedistortion,isthecentroidoftheclustermembers.

Itisthereforesufficienttodetermineonlythepartitioningortheclustercentroidsto
defineasolution.Thisimpliestwoalternativeapproachestotheclusteringproblem:
Centroidbased(CB)
Partitioningbased(PB)

Inthecentroidbasedvariant,thesetsofcentroidsaretheindividualsofthepopulation,
and they are the objects of genetic operations. Each solution is represented by
anM length array of Kdimensional vectors (see Fig. 3). The elementary unit is
thereforeasinglecentroid.Thisisanaturalwaytodescribetheprobleminthecontext
ofvectorquantization.Inthiscontextthesetofcentroidsstandsforacodebookofthe
application and the partitions are of secondary importance. The partitioning table,
however, is needed when evaluating the distortion values of the solutions and it is
calculatedusingthenearestneighborcondition.
Inthepartitioningbasedvariantthepartitioningsaretheindividualsofthepopulation.
EachpartitioningisexpressedasanarrayofNintegersfromtherange1..Mdefining
cluster membership of each data object. The elementary unit (gene) is asingle
membership value. The centroids are calculated using the centroid condition. The
6

partitioningbased variant is commonly used in the traditional clustering algorithms


because the aim is to cluster the data with no regard to the representatives of the
clusters.
Twomethodsreportedintheliteratureapplythecentroidbasedapproach[8,9]andthe
other two methods apply the partitioningbased approach [7,10]. From the genetic
algorithmspointofview,thedifferencebetweenthetwovariantsliesintherealization
of the crossover and mutation phases. The problem of the partitioningbased
representationisthattheclustersbecomenonconvex(inthesensethatobjectsfrom
differentpartsofthedataspacemaybelongtothesamecluster)ifasimplerandom
crossovermethodisapplied,asproposedin[7,10].Theconvexityofthesolutionscan
berestoredbyapplyingthekmeansalgorithm(seeSection3.5),butthentheresulting
clustercentroidstendtomovetowardstothecentroidofthedataset.Thismovesthe
solutions systematically to the same direction, which slows down the search. It is
thereforemoreeffectivetooperatewiththeclustercentroidsthanwiththepartitioning
table. Furthermore, all practical experiments have indicated that the PBvariant is
inferiortotheCBvariant.WewillthereforelimitourdiscussiontotheCBvariantin
therestofthepaper.

D a ta s e t

P a r titio n in g
ta b le
1 1
2 3
3 42
4

C lu s te r
c e n tr o id s
1

N d a ta o b je c ts
42

11 40 75

K -d im e n s io n a l v e c to rs

Figure3.Illustrationofasolution.

3.2Selectionmethods
Selectionmethoddefinesthewayanewgenerationisconstructedfromthecurrentone.
Itconsistsofthefollowingthreeparts:
DeterminingtheSBsurvivors.

SelectingthecrossingsetofSCsolutions.

Selectingthepairsforcrossoverfromthecrossingset.

Westudythefollowingselectionmethods:
Roulettewheelselection
Elitistselectionmethod1
Elitistselectionmethod2

Thefirstmethodisa probabilitybased variant.Inthisvariant,onlythebestsolution


survives (SB=1) and the crossing set consists of all the solutions (SC=S). For the
crossover,S1randompairsarechosenbytheroulettewheelselection.Theweighting
function7forsolutionis
w( )

1
1 distortion ( )

(3)

andtheprobabilitythattheithsolutionisselectedtocrossoveris

p( i )

w( i )
S

w(
j 1

(4)

wherej, j 1 S , arethesolutionsofthecurrentpopulation.
InthefirstelitistvariantSBbestindividualssurvive.Theyalsocomposethecrossover
set,i.e.SC=SB.Allthesolutionsinthecrossoversetarecrossedwitheachothersothat
thecrossoverphaseproducesSC(SC1)/2newsolutions.ThepopulationsizeisthusS=
SC(SC+1)/2.HereweuseSC=9resultinginthepopulationsizeofS=45.
In the second elitist variant only the best solution survives (SB=1). Except for the
number ofthe survivors,the algorithm is the same as the first variant; the SC best
solutionsarecrossedwitheachothergivingapopulationsizeof S=1+SC(SC1)/2.
HereweuseSC=10whichgivesapopulationsizeofS=46.Notethatwecanselectthe
desiredpopulationsizebydroppingoutapropernumberofsolutions.

3.3Crossoveralgorithms
Theobjectofthecrossoveroperationistocreateanew(andhopefullybetter)solution
fromthetwoselectedparentsolutions(denotedherebyAandB).IntheCBvariantsthe
clustercentroidsaretheelementaryunitsoftheindividuals.Thecrossovercanthusbe
consideredastheprocessofselectingMclustercentroidsfromthetwoparentsolutions.
Nextwerecalltwoexistingcrossovermethods(randomcrossover, centroiddistance)
and introduce three new methods (pairwise crossover, largest partitions, pairwise
nearestneighbor).
Randomcrossover:
Random multipointcrossover isperformedbypicking M/2randomlychosencluster
centroidsfromeachofthetwoparentsinturn.Duplicatecentroidsarerejectedand
replacedbyrepeatedpicks.Thisisanextremelysimpleandquiteefficientmethod,
becausethereis(intheunsortedcase)nocorrelationbetweenneighboringgenestobe
takenadvantageof.Themethodworksinasimilarwaytotherandomsinglepoint

crossover methods of the PBbased variants [7,10] but it avoids the nonconvexity
problemofthePBapproach.
Centroiddistance[8,9]:
Iftheclustersaresortedbysomecriterion,singlepointcrossovermaybeadvantageous.
In[8],theclustersweresortedaccordingtotheirdistances fromthecentroidofthe
entiredataset.Inasense,theclustersaredividedintotwosubsets.Thefirstsubset
(centralclusters)consistsoftheclustersthatareclosetothecentroidofthedataset,
andthesecondsubset(remoteclusters)consistsoftheclustersthatarefarfromthedata
setcentroid.AnewsolutioniscreatedbytakingthecentralclustersfromsolutionAand
theremoteclustersfromsolutionB.Notethatonlytheclustercentroidsaretaken,the
dataobjectsarepartitionedusingthenearestneighborcondition.Thechangeoverpoint
canbeanythingbetween1and M;weusethehalfpoint(M/2)inourimplementation.
Asimplifiedversionofthesameideawasconsideredin[9]forscalarquantizationof
images(K=1).
Pairwisecrossover:
Itisdesiredthatthenewindividualshouldinheritdifferentgenesfromthetwoparents.
Thesortingoftheclustersbythecentroiddistanceisanattemptofthiskindbuttheidea
canbedevelopedevenfurther.Theclustersbetweenthetwosolutionscanbepairedby
searchingthenearestcluster(inthesolution B)foreveryclusterinthesolution A.
Crossoveristhenperformedbytakingoneclustercentroid(byrandomchoice)from
eachpairofclusters.Inthiswaywetrytoavoidselectingsimilarclustercentroidsfrom
bothparentsolutions.Thepairingisdoneinagreedymannerbytakingforeachcluster
inAthenearestavailableclusterinB.Aclusterthathasbeenpairedcannotbechosen
again,thusthelastclusterinAispairedwiththeonlyoneleftinB.Thisalgorithmdoes
notgivetheoptimalpairing(2assignment)butitisareasonablygoodheuristicforthe
crossoverpurpose.
Largestpartitions:
In the largest partitions algorithm the M cluster centroids are picked by a greedy
heuristicbasedontheassumptionthatthelargerclustersaremoreimportantthanthe
smaller ones. This is a reasonable heuristic rule since our aim is to minimize the
intracluster diversity. The cluster centroids should thus be assigned to a large
concentrationofdataobjects.
EachclusterinthesolutionsAandBisassignedwithanumber,clustersize,indicating
howmanydataobjectsbelongtoit.Ineachphase,wepickthecentroidofthelargest
cluster.Assumethatcluster i waschosenfrom A.Theclustercentroid Ci isremoved
fromAtoavoiditsreselection.ForthesamereasonweupdatetheclustersizesofBby
removingtheeffectofthosedataobjectsinBthatwereassignedtothechosenclusteri
inA.

Pairwisenearestneighbor:
An alternative strategy is to consider the crossover phase as aspecial case of the
clusteringproblem.Infact,ifwecombinetheclustercentroidsAandB,theirunioncan
betreatedasadatasetof2Mdataobjects.NowouraimistogenerateMclustersfrom
thisdataset.Thiscanbedonebyanyexistingclusteringalgorithm.Hereweconsider
the use of pairwise nearest neighbor (PNN) [19]. It is a variant of the socalled
agglomerativenestingalgorithmandwasoriginallyproposedforvectorquantization.
ThePNNalgorithmstartsbyinitializingaclusteringofsize2Mwhereeachdataobject
is considered as its own cluster. Two clusters are combined at each step of the
algorithm. The clusters to be combined are the ones that increase the value of the
distortionfunctionleast.Thisstepisiterated M times,afterwhichthenumberofthe
clustershasdecreasedtoM.

3.4Mutations
Eachclustercentroidisreplacedbyarandomlychosendataobjectwithaprobabilityp.
Thisoperationisperformedbeforethepartitioningphase.Wefixthisprobabilitytop=
0.01,whichhasgivengoodresultsinourexperiments.

3.5Finetuningbythekmeansalgorithm
Onecantrytoimprovethealgorithmbyapplyingafewstepsofthekmeansalgorithm
foreachnewsolution[3,11].Thecrossoveroperationfirstgeneratesaroughestimate
ofthesolutionwhichisthenfinetunedbythekmeansalgorithm.Thismodification
allowsfasterconvergenceofthesolutionthanpuregeneticalgorithm.
Ourimplementationofthekmeansalgorithmisthefollowing.Theinitialsolutionis
iterativelymodifiedbyapplyingthetwooptimalityconditions(ofSection3.1)inturn.
In the first stage the centroids are fixed andthe clusters are recalculated using the
nearestneighborcondition.Inthesecondstagetheclustersarefixedandnewcentroids
arecalculated.Theoptimalityconditionsguaranteethatthenewsolutionisalwaysat
leastasgoodastheoriginalone.

4.Testresults
TheperformanceofthegeneticalgorithmisillustratedinFig.4 asafunctionofthe
numberofgenerationsforBridgeandBridge2.Theinclusionofthekmeansalgorithm
is essential; even the worst candidate in each generation is better than any of the
candidateswithoutkmeans.Thedrawbackofthehybridizationisthattherunningtime
considerably growsasthenumberofkmeans stepsincreases.Fortunately,itisnot
necessarytoperformthekmeansalgorithmtoitsconvergencebutonlyacoupleof

10

steps(twointhepresentwork)suffice.Theresultsaresimilarfortheotherdatasetsnot
showninFig.4.
TheperformanceofthedifferentcrossovermethodsisillustratedinFig.5asafunction
ofthenumberofgenerations.ThepairwisecrossoverandthePNNmethodoutperform
thecentroiddistanceandrandomcrossovermethods.OfthetestedmethodsthePNN
algorithm is thebestchoice. Itgivesthebestclustering withthe fewestnumberof
iterations.Onlyforbinarydata,thepairwisecrossovermethodobtainsslightlybetter
resultsinthelongrun.
Theperformanceofthedifferentselection andcrossovermethods issummarizedin
Table 2. The selection method seems to have a smaller effect on the overall
performance. In most cases the elitist variants are better than the roulette wheel
selection.However,forthebestcrossovermethod(PNNalgorithm)theroulettewheel
selectionisaslightlybetterchoice.
Theaboveobservationsdemonstratetwoimportantpropertiesofgeneticalgorithmsfor
largescaleclusteringproblems.AsuccessfulimplementationofGAshoulddirectthe
searchefficientlybutitshouldalsoretainenoughgeneticvariationinthepopulation.
Thefirstpropertyisclearlymoreimportantbecauseallideasbasedonit(inclusionof
kmeans, PNN crossover, elitist selection) gave good results. Their combination,
however,reduces thegenetic variation sothatthealgorithm converges tooquickly.
Thus,thebestresultswerereachedonlyforthebinarydatasets.Inthebestvariant,we
therefore use the roulette wheel selection to compensate the loss of the genetic
variation.
Amongtheotherparameters,theamountofmutationshadonlyasmalleffectonthe
performance. Another interesting but less important question is whether extra
computingresourcesshouldbeusedtoincreasethegenerationsizeorthenumberof
iterationrounds.Additionaltestshaveshownthatthenumberofiterationroundshasa
slightedgeoverthegenerationsizebutthedifferenceissmallandthequalityofthebest
clusteringdependsmainlyonthetotalnumberofcandidatesolutionstested.
The best variant of GA is next compared to other existing clustering algorithms.
Simulated annealing algorithm (SA) is implemented here as proposed in 20. The
method is basically the same as the k-means algorithm but randomnoiseis added to the
cluster centroids after each step. A logarithmic temperature schedule decreases the
temperatureby1%aftereachiterationstep.
Theresultsforkmeans,PNN,SAandGAaresummarizedinTable3.Weobservethat
GAclearlyoutperformstheotheralgorithmsusedincomparison.SAcanmatchtheGA
resultsonlyforthetwosmallesttestsets,andifthemethodisrepeatedseveraltimes,as
showninTable4.ThestatisticsshowalsothatGAisrelativelyindependentonthe
initializationwhereastheresultsofkmeanshavemuchhighervariation.Accordingto
the Student'sttest (independentsampleswithnoassumptionsontheequalityofthe
variances) the difference between the GA results and the kmeans/SA results are
significant (with risk of wrong decision p<0.05) except for SA and GA results for
Bridge.
11

It was proposed in 10 that the initial population would be constructed as the output of
S independent runs of the k-means algorithm. However, this approach had in our tests no
benefits compared to the present approach where k-means is applied in each generation.
Only moderate improvement is achieved if k-means was applied to the initial population
only. Furthermore, if k-means iterations are already integrated in each iteration, random
initialization can be used as well. For a more detailed discussion of various
hybridizationsofGAandkmeans,see21.
ThedrawbackofGAisitshighrunningtime.ForBridgetherunningtimes(min:sec)of
the algorithms (kmeans, SA, PNN, GA) were 0:38, 13:03, 67:00 and 880:00
respectively. Higherqualityclusteringsarethusobtainedatthecostoflargerrunning
time.

260

1.65

without k-means

1.60

without k-means

1.55
distortion

distortion

240
220
200

with k-means

1.50
1.45
1.40
1.35

180

with k-means

1.30

160

1.25
0

10 15 20 25 30 35 40 45 50
generation

10 15 20 25 30 35 40 45 50
generation

Figure4.Qualityofthebest(solidlines)andworstcandidatesolutions(brokenlines)
asafunctionofgenerationnumberforBridge(left)andforBridge2(right).Theelitist
selectionmethod1wasappliedwiththerandomcrossovertechnique.Twostepsofthe
kmeansalgorithmwereapplied.

190

1.50

185

1.45

distortion

Random

distortion

Random

180

Largest
partitions

1.40

175

1.35

170

Cent.dist.

Cent.dist.
Pairwise

165

Pairwise

1.30

PNN

PNN

160

Largest
partitions

1.25
0

10 15 20 25 30 35 40 45 50
generation

12

10 15 20 25 30 35 40 45 50
generation

Figure5.ConvergenceofthevariouscrossoveralgorithmsforBridge(left)andfor
Bridge2(right).Theelitistselectionmethod1wasused,andtwostepsofthekmeans
algorithmwereapplied.

13

Table2.Performancecomparisonoftheselectionandcrossovertechniques.GAresults
areaveragedfromfivetestruns.Thedistortionvaluesfor Bridge aredueto(2).For
Bridge 2 thetableshowstheaveragenumberofdistortedattributes perdataobject
(varyingfrom0toK).Populationsizeis45forelitistmethod1,and46formethod2.
Bridge
Roulettewheel
Elitistmethod1
Elitistmethod2

Bridge2
Roulettewheel
Elitistmethod1
Elitistmethod2

Random
crossover
174.77
173.46
173.38
Random
crossover
1.40
1.34
1.35

Centroid
distance
172.09
168.73
168.21
Centroid
distance
1.34
1.30
1.30

Pairwise
crossover
168.36
164.34
164.28
Pairwise
crossover
1.34
1.27
1.26

Largest
partitions
178.45
172.44
171.93
Largest
partitions
1.30
1.30
1.29

PNN
algorithm
162.09
162.91
162.90
PNN
algorithm
1.28
1.28
1.27

Table 3. Performance comparison ofvarious algorithms. InGA,the roulette wheel


selectionmethodandthePNNcrossoverwithtwostepsofthekmeansalgorithmwere
applied.ThekmeansandSAresultsareaveragesfrom100testruns;GAresultsfrom
5testruns.

Bridge
MissAmerica
House
Bridge2
Latesmariae
SS2

kmeans
179.68
5.96
7.81
1.48
5.28
1.14

PNN
169.15
5.52
6.36
1.33
5.41
0.34

SA
162.45
5.26
6.03
1.52
5.19
0.32

GA

162.09
5.18
5.92
1.28
4.56
0.31

Table4.Statistics(min,max,andstandarddeviation)ofthetestruns,seeTable3for
parametersettings.
minmax
st.dev.
Bridge
MissAmerica
House
Bridge2
Latesmariae
SS2

kmeans

SA

GA

176.85183.93
1.442
5.806.11
0.056
7.388.32
0.196
1.451.52
0.015
4.566.67
0.457
0.402.29
0.899

162.08163.29
0.275
5.245.29
0.013
5.976.08
0.023
1.431.50
0.014
4.565.28
0.166
0.310.35
0.009

161.75162.39
0.305
5.175.18
0.005
5.915.93
0.009
1.281.29
0.002
4.564.56
0.000
0.310.31
0.000

14

15

5.Conclusions
GAsolutionsforlargescaleclusteringproblemswerestudied.Theimplementationof
aGAbasedclusteringalgorithmisquitesimpleandstraightforward.However,problem
specificmodificationswereneededbecauseofthenatureofthedata.Newcandidate
solutionsarecreatedinthecrossoverbuttheyaretooarbitrarytogiveareasonable
solutiontotheproblem.Thus,thecandidatesolutionsmustbefinetunedbyasmall
numberofstepsofthekmeansalgorithm.
ThemainparametersofGAfortheclusteringproblemsstudiedherearetheinclusionof
kmeanssteps,andthecrossovertechnique.Themutationprobabilityandthechoiceof
selectionmethodseemtobeofminorimportance.Thecentroidbasedrepresentationfor
thesolutionwasapplied.TheresultswerepromisingforthisconfigurationofGA.The
resultsofGA(whenmeasuredbytheintraclusterdiversity)werebetterthanthoseof
kmeansandPNN.FornonbinarydatasetsSAgavecompetitiveresultstoGAwith
lesscomputingefforts,butforbinarydatasetsGAisstillsuperior.Themajordrawback
of GA is the high running time, which may in some cases prohibit the use of the
algorithm.

Acknowledgements
The work of Pasi Frnti was supported by a grant from the Academy of Finland.

16

References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]

[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]

L.KaufmanandP.J.Rousseeuw,FindingGroupsinData:AnIntroductiontoClusterAnalysis,
JohnWileySons,NewYork,1990.
H.Spth,ClusterAnalysisAlgorithmsforDataReductionandClassificationofObjects,Ellis
HorwoodLimited,WestSussex,UK,1980.
A.GershoandR.M.Gray, VectorQuantizationandSignalCompression.KluwerAcademic
Publishers,Boston,1992.
M.Gyllenberg, T.Koski and M.Verlaan, Classification of binary vectors by stochastic
complexity.JournalofMultivariateAnalysis,63(1),4772,October1997.
K.S.AlSultan and M.M. Khan, Computational experience onfour algorithms for the hard
clusteringproblem.PatternRecognitionLetters,17,295308,1996.
D.E.Goldberg,GeneticAlgorithmsinSearch,OptimizationandMachineLearning.Addison
Wesley,Reading,1989.
V.DelportandM.Koschorreck,Geneticalgorithmforcodebookdesigninvectorquantization.
ElectronicsLetters,31,8485,January1995.
J.S.Pan, F.R.McInnes and M.A.Jack, VQ codebook design using genetic algorithms.
ElectronicsLetters,31,14181419,August1995.
P.Scheunders,AgeneticLloydMaxquantizationalgorithm.PatternRecognitionLetters,17,
547556,1996.
C.A.Murthy and N.Chowdhury, In search of optimal clusters using genetic algorithms.
PatternRecognitionLetters,17,825832,1996.
J.B.McQueen,Somemethodsofclassificationandanalysisofmultivariateobservations,Proc.
5thBerkeleySymp.Mathemat.Statist.Probability1,281296.Univ.ofCalifornia,Berkeley,
USA,1967.
N.M.Nasrabadi and R.A.King, Image coding using vector quantization: a review. IEEE
TransactionsonCommunications,36,957971,1988.
C.F.Barnes, S.A.Rizvi and N.M.Nasrabadi, Advances in residual vector quantization:
areview.IEEETransactionsonImageProcessing,5,226262,February1996.
P. Frnti, T. Kaukoranta and O. Nevalainen, On the design of a hierarchical BTC-VQ
compression system, Signal Processing: Image Communication, 8, 551-562, 1996.
J.E.Fowler Jr., M.R.Carbonara and S.C.Ahalt, Image coding using differential vector
quantization. IEEETransactionsonCircuitsandSystemsforVideoTechnology,Vol.3(5),
pp.350367,October1993.
M.T.Orchard, C.A.Bouman, Color quantization of images. IEEE Transactions on Signal
Processing,39,26772690,December1991.
X.Wu,YIQVectorquantizationinanewcolorpalettearchitecture. IEEETransactionson
ImageProcessing,5,321329,February1996.
L.Kuusipalo, Diversification of endemic Nile perch Lates Mariae (Centropomidae, Pisces)
populationsinLakeTanganyika,EastAfrica,studiedwithRAPDPCR.Proc.Symposiumon
LakeTanganyikaResearch,6061,Kuopio,Finland,1995.
W.H.Equitz,Anewvectorquantizationclusteringalgorithm.IEEETransactionsonAcoustics,
Speech,andSignalProcessing,37,15681575,October1989.
K.ZegerandA.Gersho,Stochasticrelaxationalgorithmforimprovedvectorquantiserdesign.
ElectronicsLetters,25(14),896898,July1989.
P. Frnti, J. Kivijrvi, T. Kaukoranta and O. Nevalainen, Genetic algorithms for codebook
generation in vector quantization, Proc. 3rd Nordic Workshop on Genetic Algorithms, 207-222,
Helsinki, Finland, August 1997.

17

You might also like