You are on page 1of 65

(Data Mining)

Email: junming.shao@gmail.com
Tel:18280096713

Data Mining Lab, Big Data Research Center


School of Computer Science and Engineering, UESTC
Http://staff.uestc.edu.cn/shaojunming
(Cluster Analysis)
Clustering

(Clusters)

1.


WEB
2.


3. /




(a structure of natural grouping)


high intra-class similarity
low inter-class similarity

partitioning method
hierarchical method
density-based method
grid-based method
1 (partitioning method)

n
kk n
kk

k


K-MeansK-, K-MedoidsK-
1.1 K-

DK

E i 1 xC x xi
k 2

xi i Ci i
K-MEANS
kn
k
(1)assign initial values for means; /*k
*/
(2) REPEAT
(3) FOR j=1 to n DO assign each xj to the closest clusters;

(4) FOR i=1 to k DO / **/


1
xi
Ci
xC i
x
(5) Compute /*E*/
2
E i 1 xC x xi
k

(6) UNTIL E
K-MEANS
K=2

10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5

4 4

3
4


3
2
2 2


1 1
1
0 0
0

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10


10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

4
5
5

4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

1. K-MEANS

2. K-MEANS

3. K-MEANS

kparameter-free clustering algorithms




DBSCAN
K-
1.2 K- (K-MEDOIDS)

,PAMPartitioning Around Medoids.

k
E dist(p, oi )
i 1 pCi

EpCioi


1()

3

Oh
Oi

OhOi

TCih j C jih
CjihOhOi Oj

OiOh

Oi

Oj:

OjOm OjOh
Cjih =d(j, h)-d(j, i)
Cjih =d(j, m)-d(j, i)

Oj OiOh
Cjih =0 Cjih =d(j, h)-d(j, m)
PAMk-
kn
k
1 k
2 REPEAT
3
4 REPEAT
5 Oi
6 REPEAT
7 Oh
8 OhOiS
9 UNTIL
10 UNTIL
11 IF TC
0 THEN TC
k
12UNTIL TC0.

A

PAMk=2

A B C D E C
A 0 1 2 2 3 A

B 1 0 2 4 3 D
B
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0 E
Step1 5
2{A 2
C
A
B},{ACD}
2 D
{BE} B

Step2 A
B{CDE} E

PAM
TCAC TCAD TCAE
TCBCTCBD TCBE
A,B
TCAC
ACAABAC
ABCAAC=d(A,B)-d(A,A)=1
BACBCBAC=0
CAACC
PAMCCAC=d(C,C)-
d(C,A)=0-2=-2
DAACD
CPAM
CDAC=d(D,C)-d(D,A)=1-2=-1
EBACE
BPAMCEAC=0
TCAC=CAAC+ CBAC+ CBAC+ CDAC+ CEAC=1+0-2-1+0= -2

K-MEANS

O(L *k*(n-k)2)
2







AGNES
DIANA

k

Step Step Step Step Step


agglomerative
0 1 2 3 4
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step Step Step Step Step (DIANA)
4 3 2 1 0

2.1 AGNES

AGNES (AGglomerative NESting)

AGNES
nk
k
(1)
(2) REPEAT
(3)
(4)
(5) UNTIL


2.2DIANA

DIANA (Divisive ANAlysis)

1
d avg (Ci , C j )
ni n j
xCi yC j x y
DIANA
nk
k
1
2 FOR i=1; ik; i++) DO BEGIN
3 C
4 Cppsplinter
groupold party
5. REPEAT
6 old partysplinter group
old partysplinter group
7 UNTIL old partysplinter group
8 splinter groupold party

9 END.
1

11+1+1.414+3.6+4.24+4.47+5
/7=2.96
1 2
22.52632.684
1 1 1 2.1852.186
2 1 2 2.6872.52682.96
1splinter group
3 2 1
old party
4 2 2 2old partysplinter group
5 3 4 old party
splinter group2
6 3 5
32splinter group3
7 4 4 42splinter group4
8 4 5 5old partysplinter group
k=2

splinter group Old party
1 {12345678} {1} {2345678}
2 {12345678} {12} {345678}
3 {12345678} {123} {45678}
4 {12345678} {1234} {5678}
5 {12345678} {1234} {5678}
1.3








I/ODBSCANOPTICS
DENCLUE
3.1DBSCAN

DBSCANDensity-Based Spatial Clustering of


Applications with Noise


DBSCAN






DBSCAN

DBSCAN
1 - Eps
2 -MinPts

=1cmMinPts=5q

3 Dpq-
qpq

=1cmMinPts=5qp
q


4 p1p2pn
p1=qpn=ppiD1<=i<=npi+1pi
MitPtspqMinPts

=1cmMinPts=5qp1
qMitPtspp1MitPts
pqMinPts


5 Do
pqoMinPtsp
qMinPts

6 :


DBSCAN-
p-MinPts
pDBSCAN



DBSCAN
5-5 DBSCAN
nMinPts

1. REPEAT
2.
3. IF THEN

4. ELSE ()

5. UNTIL
DBSCAN
DBSCAN
n=12=1MinPts=4
DBSCAN

1 2
1 1 0 1 1 2
2 4 0 2 2 2
3 0 1 3 3 3
4 1 1 4 4 5 C1{134591012}
5 2 1 5 5 3 C1
6 3 1 6 6 3
7 4 1 7 7 5 C2{267811}
8 5 1
8 8 2 C2
9 0 2
9 9 3 C1
10 1 2
10 10 4 C1
11 4 2
11 11 2 C2
12 1 3
12 12 2 C1

{134591012}{267811}
111
24

221

2
331

3
1 1 2 441
2 2 2
5
3 3 3
42{134
591012}
4 4 5 C1{13459
1012} 551
5 5 3 C1 661
6 6 3
3
7 7 5 C2{2678
771
11} 5
8 8 2 C2 {267811}
9 9 3 C1 882
1 10 4 C1 991
0 10101
1 11 2 C2 11112
1
12121
1 12 2 C1
2

DBSCAN

EPS
Minspt
I/O

1.4






STING

STING
STING(Statistaical Information Grid_based
method)




count
ms()min()
max()
STING



STING

O(n)O(g), g
nSTING

CLIQUE


(1)


(, ),
(discordancy test)


: ,


(1)
Hn
F,
HOi Fi =1, 2, , n
OiF

, Oi
Vi, T
SP(V )=Prob(T>V )
i i
SP(V ), O ,
i i
. , OiG
(1)

(,
), ,

,
;

(2)

: DB (p, d)-T
o, T p o d

, o
, pdoDB(p,d)

(2)

, Rk-d, o
d
Md-.
oM+1, o





(2)

(cell-based)


. ,
, ,

(2)

Md-
M,
,

M

, o, o.
d-M
(3)


,
,




(3)

(smoothing factor)
.

.

References
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high
dimensional data for data mining applications. SIGMOD'98
M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the
clustering structure, SIGMOD99.
P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scietific, 1996
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering
clusters in large spatial databases. KDD'96.
M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing
techniques for efficient class identification. SSD'95.
D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning,
2:139-172, 1987.
D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on
dynamic systems. In Proc. VLDB98.
S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases.
SIGMOD'98.
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.
L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets.
VLDB98.
G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to
Clustering. John Wiley and Sons, 1988.
P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997.
R. Ng and J. Han. Efficient and effective clustering method for spatial data mining.
VLDB'94.
E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large data
sets. Proc. 1996 Int. Conf. on Pattern Recognition, 101-105.
G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution clustering
approach for very large spatial databases. VLDB98.
W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial
Data Mining, VLDB97.
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method for
very large databases. SIGMOD'96.
Take Home Message

1. 4
?

2. Kmeans

3. DBSCAN

4.

You might also like