Chapter 6

(Data Mining)
Email: junming.shao@gmail.com
Tel:18280096713
Data Mining Lab, Big Data Research Center

School of Computer Science and Engineering, UESTC
Http://staff.uestc.edu.cn/shaojunming
(Cluster Analysis)
Clustering
(Clusters)
1.

WEB
2.

3. /

(a structure of natural grouping)

high intra-class similarity
low inter-class similarity
partitioning method
hierarchical method
density-based method
grid-based method
1 (partitioning method)
n
kk n
kk
k

K-MeansK-, K-MedoidsK-
1.1 K-
DK
E i 1 xC x xi
k 2
xi i Ci i
K-MEANS
kn
k
(1)assign initial values for means; /*k
*/
(2) REPEAT
(3) FOR j=1 to n DO assign each xj to the closest clusters;
(4) FOR i=1 to k DO / **/

1
xi
Ci
xC i
x
(5) Compute /*E*/
2
E i 1 xC x xi
k
(6) UNTIL E
K-MEANS
K=2

10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
3
4

3
2
2 2

1 1
1
0 0
0

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
4
5
5
4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
1. K-MEANS
2. K-MEANS
3. K-MEANS

kparameter-free clustering algorithms

DBSCAN
K-
1.2 K- (K-MEDOIDS)
,PAMPartitioning Around Medoids.
k
E dist(p, oi )
i 1 pCi
EpCioi

1()
3

Oh
Oi
OhOi
TCih j C jih
CjihOhOi Oj
OiOh
Oi

Oj:

OjOm OjOh
Cjih =d(j, h)-d(j, i)
Cjih =d(j, m)-d(j, i)

Oj OiOh
Cjih =0 Cjih =d(j, h)-d(j, m)
PAMk-
kn
k
1 k
2 REPEAT
3
4 REPEAT
5 Oi
6 REPEAT
7 Oh
8 OhOiS
9 UNTIL
10 UNTIL
11 IF TC
0 THEN TC
k
12UNTIL TC0.

A
PAMk=2
A B C D E C
A 0 1 2 2 3 A
B 1 0 2 4 3 D
B
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0 E
Step1 5
2{A 2
C
A
B},{ACD}
2 D
{BE} B
Step2 A
B{CDE} E
PAM
TCAC TCAD TCAE
TCBCTCBD TCBE
A,B
TCAC
ACAABAC
ABCAAC=d(A,B)-d(A,A)=1
BACBCBAC=0
CAACC
PAMCCAC=d(C,C)-
d(C,A)=0-2=-2
DAACD
CPAM
CDAC=d(D,C)-d(D,A)=1-2=-1
EBACE
BPAMCEAC=0
TCAC=CAAC+ CBAC+ CBAC+ CDAC+ CEAC=1+0-2-1+0= -2

K-MEANS
O(L *k*(n-k)2)
2

AGNES
DIANA

k
Step Step Step Step Step

agglomerative
0 1 2 3 4
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step Step Step Step Step (DIANA)
4 3 2 1 0

2.1 AGNES
AGNES (AGglomerative NESting)
AGNES
nk
k
(1)
(2) REPEAT
(3)
(4)
(5) UNTIL

2.2DIANA
DIANA (Divisive ANAlysis)
1
d avg (Ci , C j )
ni n j
xCi yC j x y
DIANA
nk
k
1
2 FOR i=1; ik; i++) DO BEGIN
3 C
4 Cppsplinter
groupold party
5. REPEAT
6 old partysplinter group
old partysplinter group
7 UNTIL old partysplinter group
8 splinter groupold party

9 END.
1

11+1+1.414+3.6+4.24+4.47+5
/7=2.96
1 2
22.52632.684
1 1 1 2.1852.186
2 1 2 2.6872.52682.96
1splinter group
3 2 1
old party
4 2 2 2old partysplinter group
5 3 4 old party
splinter group2
6 3 5
32splinter group3
7 4 4 42splinter group4
8 4 5 5old partysplinter group
k=2

splinter group Old party
1 {12345678} {1} {2345678}
2 {12345678} {12} {345678}
3 {12345678} {123} {45678}
4 {12345678} {1234} {5678}
5 {12345678} {1234} {5678}
1.3

I/ODBSCANOPTICS
DENCLUE
3.1DBSCAN
DBSCANDensity-Based Spatial Clustering of

Applications with Noise

DBSCAN

DBSCAN

DBSCAN
1 - Eps
2 -MinPts

=1cmMinPts=5q
3 Dpq-
qpq

=1cmMinPts=5qp
q

4 p1p2pn
p1=qpn=ppiD1<=i<=npi+1pi
MitPtspqMinPts

=1cmMinPts=5qp1
qMitPtspp1MitPts
pqMinPts

5 Do
pqoMinPtsp
qMinPts
6 :

DBSCAN-
p-MinPts
pDBSCAN

DBSCAN
5-5 DBSCAN
nMinPts

1. REPEAT
2.
3. IF THEN

4. ELSE ()

5. UNTIL
DBSCAN
DBSCAN
n=12=1MinPts=4
DBSCAN
1 2
1 1 0 1 1 2
2 4 0 2 2 2
3 0 1 3 3 3
4 1 1 4 4 5 C1{134591012}
5 2 1 5 5 3 C1
6 3 1 6 6 3
7 4 1 7 7 5 C2{267811}
8 5 1
8 8 2 C2
9 0 2
9 9 3 C1
10 1 2
10 10 4 C1
11 4 2
11 11 2 C2
12 1 3
12 12 2 C1
{134591012}{267811}
111
24

221

2
331

3
1 1 2 441
2 2 2
5
3 3 3
42{134
591012}
4 4 5 C1{13459
1012} 551
5 5 3 C1 661
6 6 3
3
7 7 5 C2{2678
771
11} 5
8 8 2 C2 {267811}
9 9 3 C1 882
1 10 4 C1 991
0 10101
1 11 2 C2 11112
1
12121
1 12 2 C1
2

DBSCAN
EPS
Minspt
I/O
1.4

STING

STING
STING(Statistaical Information Grid_based
method)

count
ms()min()
max()
STING

STING
O(n)O(g), g
nSTING
CLIQUE

(1)

(, ),
(discordancy test)

: ,

(1)
Hn
F,
HOi Fi =1, 2, , n
OiF
, Oi
Vi, T
SP(V )=Prob(T>V )
i i
SP(V ), O ,
i i
. , OiG
(1)
(,
), ,
,
;

(2)
: DB (p, d)-T
o, T p o d
, o
, pdoDB(p,d)

(2)

, Rk-d, o
d
Md-.
oM+1, o

(2)
(cell-based)

. ,
, ,

(2)
Md-
M,
,

M

, o, o.
d-M
(3)

,
,

(3)
(smoothing factor)
.

.

References
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high
dimensional data for data mining applications. SIGMOD'98
M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the
clustering structure, SIGMOD99.
P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scietific, 1996
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering
clusters in large spatial databases. KDD'96.
M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing
techniques for efficient class identification. SSD'95.
D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning,
2:139-172, 1987.
D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on
dynamic systems. In Proc. VLDB98.
S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases.
SIGMOD'98.
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.
L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets.
VLDB98.
G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to
Clustering. John Wiley and Sons, 1988.
P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997.
R. Ng and J. Han. Efficient and effective clustering method for spatial data mining.
VLDB'94.
E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large data
sets. Proc. 1996 Int. Conf. on Pattern Recognition, 101-105.
G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution clustering
approach for very large spatial databases. VLDB98.
W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial
Data Mining, VLDB97.
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method for
very large databases. SIGMOD'96.
Take Home Message
1. 4
?
2. Kmeans
3. DBSCAN
4.

Chapter 6

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 6

Uploaded by

Copyright:

Available Formats

(Data Mining)

Data Mining Lab, Big Data Research Center

(a structure of natural grouping)

(4) FOR i=1 to k DO / **/

kparameter-free clustering algorithms

,PAMPartitioning Around Medoids.

Step Step Step Step Step

AGNES (AGglomerative NESting)

DIANA (Divisive ANAlysis)

DBSCANDensity-Based Spatial Clustering of

You might also like