You are on page 1of 46

!"#$%&' )&*& +,-,-.

/012 32456728 &-9 :-&-9 ;&<&1&=&-


"*&->619 ?-,7214,*@
Clustering Algorithms
! Clven a seL of daLa polnLs, group Lhem lnLo a
clusLers so LhaL:
! polnLs wlLhln each clusLer are slmllar Lo each oLher
! polnLs from dlerenL clusLers are dlsslmllar
! usually, polnLs are ln a hlgh-dlmenslonal
space, and slmllarlLy ls dened uslng a
dlsLance measure
! Luclldean, Coslne, !accard, edlL dlsLance, .
Weight
Height
Chihuahuas
Dachshunds
Beagles
! A caLalog of 2 bllllon sky ob[ecLs"
represenLs ob[ecLs by Lhelr radlauon ln 7
dlmenslons (frequency bands).
! roblem: clusLer lnLo slmllar ob[ecLs, e.g.,
galaxles, nearby sLars, quasars, eLc.
! Sloan Sky Survey ls a newer, beuer verslon.
! ClusLer cusLomers based on Lhelr purchase
hlsLorles
! ClusLer producLs based on Lhe seLs of
cusLomers who purchased Lhem
! ClusLer documenLs based on slmllar words or
shlngles
! ClusLer unA sequences based on edlL
dlsLance
! Plerarchlcal (Agglomerauve):
! lnlually, each polnL ln clusLer by lLself.
! 8epeaLedly comblne Lhe Lwo nearesL" clusLers
lnLo one.
! olnL AsslgnmenL:
! MalnLaln a seL of clusLers.
! lace polnLs lnLo Lhelr nearesL" clusLer.
! key Cperauon: repeaLedly comblne Lwo
nearesL clusLers
! 1hree lmporLanL quesuons:
! Pow do you represenL a clusLer of more Lhan one
polnL?
! Pow do you deLermlne Lhe nearness" of clusLers?
! When Lo sLop comblnlng clusLers?
! Lach clusLer has a well-dened cenLrold
! l.e., average across all Lhe polnLs ln Lhe clusLer
! 8epresenL each clusLer by lLs cenLrold
! ulsLance beLween clusLers = dlsLance beLween
cenLrolds
(5,3)
o
(1,2)
o
o (2,1) o (4,1)
o (0,0) o
(5,0)
x (1.5,1.5)
x (4.5,0.5)
x (1,1)
x (4.7,1.3)
! 1he only locauons" we can Lalk abouL are Lhe
polnLs Lhemselves.
! l.e., Lhere ls no average" of Lwo polnLs.
! Approach 1: !"#$%&'() = polnL closesL" Lo
oLher polnLs.
! 1reaL clusLrold as lf lL were cenLrold, when
compuung lnLerclusLer dlsLances.
osslble meanlngs:
1. SmallesL maxlmum dlsLance Lo Lhe oLher polnLs.
2. SmallesL average dlsLance Lo oLher polnLs.
3. SmallesL sum of squares of dlsLances Lo oLher
polnLs.
4. LLc., eLc.
1 2
3
4
5
6
intercluster
distance
clustroid
clustroid
! Approach 2: lnLerclusLer dlsLance =
mlnlmum of Lhe dlsLances beLween any Lwo
polnLs, one from each clusLer.
! Approach 3: lck a nouon of coheslon" of
clusLers, e.g., maxlmum dlsLance from Lhe
clusLrold.
! Merge clusLers whose #*('* ls mosL coheslve.
! Approach 1: use Lhe )(,-.%.& of Lhe merged
clusLer = maxlmum dlsLance beLween polnLs
ln Lhe clusLer.
! Approach 2: use Lhe average dlsLance
beLween polnLs ln Lhe clusLer.
! Approach 3: use a denslLy-based approach:
Lake Lhe dlameLer or average dlsLance, e.g.,
and dlvlde by Lhe number of polnLs ln Lhe
clusLer.
! erhaps ralse Lhe number of polnLs Lo a power
rsL, e.g., square-rooL.
! SLop when we have k clusLers
! SLop when Lhe coheslon of Lhe clusLer
resulung from Lhe besL merger falls below a
Lhreshold
! SLop when Lhere ls a sudden [ump ln Lhe
coheslon value
! naive lmplemenLauon:
! AL each sLep, compuLe palrwlse dlsLances beLween
each palr of clusLers
! C(n
3
)
! Careful lmplemenLauon uslng a prlorlLy queue
can reduce ume Lo C(n
2
log n)
! 1oo expenslve for really blg daLa seLs LhaL
don'L L ln memory
! Assumes Luclldean space.
! SLarL by plcklng /, Lhe number of clusLers.
! lnluallze clusLers by plcklng one polnL per
clusLer.
! Lxample: plck one polnL aL random, Lhen / -1
oLher polnLs, each as far away as posslble from Lhe
prevlous polnLs.
1. lor each polnL, place lL ln Lhe clusLer whose
currenL cenLrold lL ls nearesL, and updaLe Lhe
cenLrold of Lhe clusLer.
2. Aer all polnLs are asslgned, x Lhe cenLrolds
of Lhe / clusLers.
3. Cpuonal: reasslgn all polnLs Lo Lhelr closesL
cenLrold.
! Someumes moves polnLs beLween clusLers.
1
2
3
4
5
6
7
8
x
x
Clusters after first round
Reassigned
points
! 1ry dlerenL /, looklng aL Lhe change ln Lhe
average dlsLance Lo cenLrold, as / lncreases.
! Average falls rapldly unul rlghL /, Lhen
changes llule.
k
Average
distance to
centroid
Best value
of k
x x
x x x x
x x x x
x x x
x x
x
xx x
x x
x x x
x
x x x
x
x x
x x x x
x x x
x
x
x
Too few;
many long
distances
to centroid.
x x
x x x x
x x x x
x x x
x x
x
xx x
x x
x x x
x
x x x
x
x x
x x x x
x x x
x
x
x
Just right;
distances
rather short.
x x
x x x x
x x x x
x x x
x x
x
xx x
x x
x x x
x
x x x
x
x x
x x x x
x x x
x
x
x
Too many;
little improvement
in average
distance.
! 8l8 (8radley-layyad-8elna) ls a varlanL of / -
means deslgned Lo handle very large (dlsk-
resldenL) daLa seLs.
! lL assumes LhaL clusLers are normally
dlsLrlbuLed around a cenLrold ln a Luclldean
space.
! SLandard devlauons ln dlerenL dlmenslons may
vary.
! olnLs are read one maln-memory-full aL a
ume.
! MosL polnLs from prevlous memory loads
are summarlzed by slmple sLausucs.
! 1o begln, from Lhe lnlual load we selecL Lhe
lnlual / cenLrolds by some senslble
approach.
! osslblllues lnclude:
1. 1ake a small random sample and clusLer
opumally.
2. 1ake a sample, plck a random polnL, and Lhen / -
1 more polnLs, each as far from Lhe prevlously
selecLed polnLs as posslble.
1. 1he )($!,&) $.%: polnLs close enough Lo a
cenLrold Lo be summarlzed.
2. 1he !'-0&.$$('* $.%: groups of polnLs LhaL
are close LogeLher buL noL close Lo any
cenLrold. 1hey are summarlzed, buL noL
asslgned Lo a clusLer.
3. 1he &.%,(*.) $.%: lsolaLed polnLs.
A cluster. Its points
are in the DS.
The centroid
Compressed sets.
Their points are in
the CS.
Points in
the RS
! lor each clusLer, Lhe dlscard seL ls
summarlzed by:
1. 1he number of polnLs, 1.
2. 1he vecLor SuM: (
Lh
componenL = sum of Lhe
coordlnaLes of Lhe polnLs ln Lhe (
Lh
dlmenslon.
3. 1he vecLor SuMSC: (
Lh
componenL = sum of
squares of coordlnaLes ln (
Lh
dlmenslon.
! 2) + 1 values represenL any number of polnLs.
! ) = number of dlmenslons.
! CenLrold (mean) ln (
Lh
dlmenslon = SuM
(
/1.
! SuM
(
= (
Lh
componenL of SuM.
! varlance ln dlmenslon ( can be compuLed by:
(SuMSC
(
/1 ) - (SuM
(
/1 )
2
! Cuesuon: Why use Lhls represenLauon raLher
Lhan dlrecLly sLore cenLrold and sLandard
devlauon?
1. llnd Lhose polnLs LhaL are sumclenLly
close" Lo a clusLer cenLrold, add Lhose
polnLs Lo LhaL clusLer and Lhe uS.
2. use any maln-memory clusLerlng
algorlLhm Lo clusLer Lhe remalnlng polnLs
and Lhe old 8S.
! ClusLers go Lo Lhe CS, ouLlylng polnLs Lo Lhe
8S.
3. Ad[usL sLausucs of Lhe clusLers Lo accounL for
Lhe new polnLs.
! Add n's, SuM's, SuMSC's.
4. Conslder merglng compressed seLs ln Lhe CS.
3. lf Lhls ls Lhe lasL round, merge all compressed
seLs ln Lhe CS and all 8S polnLs lnLo Lhelr
nearesL clusLer.
! Pow do we declde lf a polnL ls close enough"
Lo a clusLer LhaL we wlll add Lhe polnL Lo LhaL
clusLer?
! Pow do we declde wheLher Lwo compressed
seLs deserve Lo be comblned lnLo one?
! We need a way Lo declde wheLher Lo puL a
new polnL lnLo a clusLer.
! 8l8 suggesL Lwo ways:
1. 1he 2,3,",*'4($ )($%,*!. ls less Lhan a
Lhreshold.
2. Low llkellhood of Lhe currenLly nearesL cenLrold
changlng.
! normallzed Luclldean dlsLance from
cenLrold.
! lor polnL (5
1
,.,5
/
) and cenLrold (!
1
,.,!
/
):
1. normallze ln each dlmenslon: 6
(
= (5
(
-!
(
)/!
(

2. 1ake sum of Lhe squares of Lhe 6
(
's.
3. 1ake Lhe square rooL.
! lf clusLers are normally dlsLrlbuLed ln )
dlmenslons, Lhen aer Lransformauon, one
sLandard devlauon = ").
! l.e., 70 of Lhe polnLs of Lhe clusLer wlll have a
Mahalanobls dlsLance < ").
! AccepL a polnL for a clusLer lf lLs M.u. ls <
some Lhreshold, e.g. 4 sLandard devlauons.
!
2!
! CompuLe Lhe varlance of Lhe comblned
subclusLer.
! 1, SuM, and SuMSC allow us Lo make LhaL
calculauon qulckly.
! Comblne lf Lhe varlance ls below some
Lhreshold.
! Many alLernauves: LreaL dlmenslons
dlerenLly, conslder denslLy.
! roblem wlLh 8l8// -means:
! Assumes clusLers are normally dlsLrlbuLed ln each
dlmenslon.
! And axes are xed - elllpses aL an angle are *'%
Ck.
! Cu8L:
! Assumes a Luclldean dlsLance.
! Allows clusLers Lo assume any shape.
e
e
e
e
e
e
e
e
e
e
e
h
h
h
h
h
h
h h
h
h
h
h
h
salary
age
1. lck a random sample of polnLs LhaL L ln
maln memory.
2. ClusLer Lhese polnLs hlerarchlcally - group
nearesL polnLs/clusLers.
3. lor each clusLer, plck a sample of polnLs,
as dlspersed as posslble.
4. lrom Lhe sample, plck represenLauves by
movlng Lhem (say) 20 Loward Lhe
cenLrold of Lhe clusLer.
e
e
e
e
e
e
e
e
e
e
e
h
h
h
h
h
h
h h
h
h
h
h
h
salary
age
e
e
e
e
e
e
e
e
e
e
e
h
h
h
h
h
h
h h
h
h
h
h
h
salary
age
Pick (say) 4
remote points
for each
cluster.
e
e
e
e
e
e
e
e
e
e
e
h
h
h
h
h
h
h h
h
h
h
h
h
salary
age
Move points
(say) 20%
toward the
centroid.
! now, vlslL each polnL 0 ln Lhe daLa seL.
! lace lL ln Lhe closesL clusLer."
! normal denluon of closesL": LhaL clusLer wlLh
Lhe closesL (Lo 0 ) among all Lhe sample polnLs of
all Lhe clusLers.

You might also like