Professional Documents
Culture Documents
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at http://www.jstor.org/page/
info/about/policies/terms.jsp
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content
in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship.
For more information about JSTOR, please contact support@jstor.org.
Royal Statistical Society and Wiley are collaborating with JSTOR to digitize, preserve and extend access to Journal of the
Royal Statistical Society. Series C (Applied Statistics).
http://www.jstor.org
This content downloaded from 143.107.12.71 on Sat, 19 Dec 2015 14:17:36 UTC
All use subject to JSTOR Terms and Conditions
APPLIED STATISTICS
100
C
C
6o PIvcr = ,xC
KK = 0
DO 70 I = II, s
K = INDEX(I)
IF (ABS(LU(IC, IIl
PIVOT = ABS(W(K,
KK
C
c
C
C
C
70 CONTIlUE
IF (IE *EQ. 0) GOTO 10
SWITCHORDER
ISAVE = INDEX(ltK)
= INDEX('II
INDMEX(KiC)
INDEX(II) = ISAVE
PUT IN COUJimSor LU ONE AT A TIME
IR
IF (INTIA IIBASE(II)
IF (II *EQ. MNGOTO90
J = II + 1
ix) 80 I = it M
K = INDEXVI)
= W(E,
LU(E, II)
80 CONTINUE
II
/ LU(ISAVE,
II)
90 CCNTINUE
EKE = IRCW
RETURN
END
AlgorithmAS 136
A K-MeansClustering
Algorithm
By J. A.
HARTIGAN
and M. A.
WONG
New Haven,Connecticut,
Yale University,
U.S.A.
Keywords: K-MEANS CLUSTERING ALGORITHM; TRANSFER ALGORITHM
LANGUAGE
ISO Fortran
DESCRIPTION AND PURPOSE
The K-meansclustering
algorithmis describedin detailby Hartigan(1975). An efficient
versionof thealgorithmis presentedhere.
The aim of the K-meansalgorithmis to divideM pointsin N dimensionsintoK clusters
sum of squaresis minimized.It is not practicalto requirethatthe
so thatthe within-cluster
solutionhas minimalsum of squares againstall partitions,
exceptwhenM, N are smalland
K = 2. We seek instead"local" optima,solutionssuchthatno movementofa pointfromone
sum of squares.
clusterto anotherwill reducethe within-cluster
METHOD
This content downloaded from 143.107.12.71 on Sat, 19 Dec 2015 14:17:36 UTC
All use subject to JSTOR Terms and Conditions
STATISTICAL
ALGORITHMS
101
SUBROUTINE KMNS (A, M, N, C, K, ICI, IC2, NC, AN1, AN2, NCP, D, ITRAN, LIVE,
ITER, WSS, IFAULT)
Formalparameters
A
Real array(M, N)
M
Integer
N
Integer
Real array(K, N)
C
K
IC1
IC2
Integer
Integerarray(M)
Integerarray(M)
NC
AN1
AN2
Integerarray(K)
Real array(K)
Real array(K)
This content downloaded from 143.107.12.71 on Sat, 19 Dec 2015 14:17:36 UTC
All use subject to JSTOR Terms and Conditions
102
NCP
D
ITRAN
LIVE
ITE R
WSS
IEA ULT
APPLIED
Integerarray(K)
Real array(M)
Integerarray(K)
Integerarray(K)
Integer
Real array(K)
Integer
STATISTICS
workspace:
workspace:
workspace:
workspace:
input: themaximumnumberof iterationsallowed
output: thewithin-cluster
sumofsquaresofeach cluster
output: see Fault Diagnosticsbelow
FAULT DIAGNOSTICS
is AS 113(A transfer
fornon-hierarchial
A relatedalgorithm
classification)
given
algorithm
uses swopsas wellas transfers
to tryto overcome
byBanfieldand Bassill(1977). Thisalgorithm
theproblemof local optima;thatis, forall pairsof points,a testis made whetherexchanging
more
theclustersto whichthepointsbelongwillimprovethecriterion.It willbe substantially
expensivethanthepresentalgorithmforlargeM.
AS 58 (Euclideanclusteranalysis)givenby
The presentalgorithmis similarto Algorithm
aim at finding
a K-partition
of thesample,withwithin-cluster
Sparks(1973). Bothalgorithms
sum of squares whichcannot be reducedby movingpointsfromone clusterto the other.
of AlgorithmAS 58 does not satisfythis condition. At the
However,the implementation
stage whereeach point is examinedin turnto see if it should be reassignedto a different
cluster,onlythe closestcentreis used to checkforpossiblereallocationof the givenpoint;
a clustercentreother than the closest one may have the smallestvalue of the quantity
+ 1)} dI2,wheren,is thenumberof pointsin cluster/and di is thedistancefromclusterI
{nl/(n1
to the givenpoint. Hence, in general,AlgorithmAS 58 does not providea locally optimal
solution.
are testedon variousgenerateddata sets. The timeconsumedon the
The two algorithms
sum of squares of the resultingK-partitions
IBM 370/158and thewithin-cluster
are givenin
Table 1. While comparingthe entriesof the table, note that AS 58 does not give locally
forthe
optimalsolutionsand so shouldbe expectedto take less time. The WSS are different
two algorithmsbecause theyarriveat different
partitionsof the sets of points. A savingof
about 50 per centin timeoccursin KMNS due to using"live" setsand due to usinga quickiterationsby a factorof 4. Thus,
transfer
stagewhichreducesthenumberof optimaltransfer
KMNS comparedto AS 58 is locallyoptimaland takesless time,especiallywhenthenumber
of clustersis large.
TIME AND AccupAcY
The timeis approximately
equal to CMNKI whereI is the numberof iterations.For an
data structures
IBM 370/158,C = 21 x 10-5sec. However,different
requirequite different
numbersof iterations;and a carefulselectionof initialclustercentreswill also lead to a
considerablesavingin time.
Storagerequirement:
M(N+ 3) + K(N+ 7).
This content downloaded from 143.107.12.71 on Sat, 19 Dec 2015 14:17:36 UTC
All use subject to JSTOR Terms and Conditions
STATISTICAL
ALGORITHMS
103
TABLE 1
Time(sec)
WSS
1. M = 1000, N = 10, K = 10
AS 58
63-86
7056-71
2. M = 1000,N = 10, K = 10
AS 58
KMNS
4349
19 11
7779*70
7822-01
3. M = 1000, N = 10, K = 50
AS 58
135-71
4543-82
4. M = 1000,N = 10,K = 50
AS 58
KMNS
95-51
5131P04
5. M = 50, N = 2, K = 8
AS 58
0-17
21-03
(random
spherical
normal)
KMNS
(twowidelyseparatedrandomnormals)
(random
spherical
normal)
KMNS
(twowidely
separated
random
normals)
random
(twowidely
separated
normals)
KMNS
36-66
76-00
57-96
7065-59
456148
5096-23
0-18
21V03
AS113. A transfer
C. F. and BASSILL,L. C. (1977). Algorithm
algorithm
fornon-hierarchical
classification.
Appl.Statist.,26, 206-210.
New York: Wiley.
HARTIGAN,J.A. (1975). Clustering
Algorithms.
AS 58. Euclideanclusteranalysis.Appl.Statist.,22, 126-130.
SPARKS,D. N. (1973). Algorithm
BANFIELD,
*
C
C
C
C
C
C
This content downloaded from 143.107.12.71 on Sat, 19 Dec 2015 14:17:36 UTC
All use subject to JSTOR Terms and Conditions
104
APPLIED
IFAULT = 3
C
C
C
C
IF
(I
LE.
OR, K
STATISTICS
GE, M) RETURN
FIND ITS TWO CLOSEST
ASSIGN IT TO IC1(I).
CENTRES,
DO 50 I = 1, M
= 1
ICl(I)
= 2
IC2(I)
DO 10 IL-- 1, 2
= 0.0
DT(IL)
DO 10 J = 1, N;
DA = A(I,
J) - C(IL,
J)
= DT(IL)
DT(IL)
+ DA * DA
10 COtNTINUE
IF (I)T(1)
.
T(2!)) GOTO 2o
.
= 2
ICI(I)
=
1
IC2(I)
TEMP = DT(1
DT(1)
DT(2)
= TEMP
DT()
20 DO 50 L = 3, K
DB = 0.0
DO 30 J = 1, NJ
C
C
C
C
C
C
c
C
C
c
C
C
C
C
DC = A(I,
J) - C(L, J)
DB = DB + DC * DC
IF (DO .GE. DT(2))
GOTO 50
30 CONTINUE
IF (DB *LT. DT(1)*
GCTO 40
= D
DT(2)
IC2(I
= L
GOTO 50
= DT(1l
40 DT(2)
= ICI(I)
IC2(I)
=
DO
DT(l)
= L
IC1(I)
50 C0NTIN4UE
UPDATE CUISTER CENTRES TO BE TIIE AVERAGE
(OF POINTS CONTAINED WITHIN THEM
DO 70 L
1, K
NC(L) = 0
= 1, tN
DO 6o
6o C(L, J) = 0.0
70 CONTINUE
DO 90 I = 1, M
L = IC(I)
TIC(L) = NC(L) + i
D)O So 3 = 1, 1N
80 C(L, J) = C(L, J) + A(I,
90 CONTINTUE
3)
STAGE
IFAULT = I
K
DO 100 L =1
IF (NC(L
*EQ. 01 RETURN
100 CONTINUE
IFAULT = 0
DO 120 L = 1, IC
AA = NC(L)
1, N
DO 110 J
110 C(L, J) - C(L, J) / AA
INITIALIZE
ANI(L)
IS
AN2(L) IS
This content downloaded from 143.107.12.71 on Sat, 19 Dec 2015 14:17:36 UTC
All use subject to JSTOR Terms and Conditions
STATISTICAL
C
C
C
C
C
C
C
C
C
= AA / CM + 1.0)
AN2(L
AN1(L) = BIG
AN1(L) =
IF (AA GT. 1.0)
ITRAN(L)
I1
NCP(LL = -1
120 CONTINUE
INDEX = 0
DO 140 IJ _ 1, ITER
C
C
C
C
C
C
C
C
/ (AA - 1.0)
IF (INDEX
IC2,
NC, ANI,
ANZ, NCP,
) GOtO 150
C
C
C
C
C
C
C
C
C
C
C
105
C
C
C
C
ALGORITHMS
IC2,
NC, ANI,
AN2,
IF
GOT 150
DO 130 L 1, K
130 NCP W = 0
140 CONTIINUE
SINCE TlE SPECIFIED
NUMBER OF ITERATIONS
IFAULT IS SET TO BE EQUAL TO 2.
MAY INDICATE UNFORESEEN LOPING
TIIS
IS EXCEEDED
IFAULT = 2
CUMPUTE WITHIIN CLUSTER SUM OF SQUARES FOR EACH CLUSTER
150 Do 160 L
K
1-,
WSS(L) S 0.0
jX i6o J = 1, N
0.0
C(L, J)
16o CoNTINUE
DO 170 I
1,
It
II = ICi(I)
DO 170 J = 1, N
C(II,
J) = C(II,
170 CONTINUE
J) + ACI,
J)
DO 10 J
1, N
1, K
DOt 130 L
180 CML, J) = C(L, J) / FLOAT(NC(L))
DO 190 I = 1, M
II = ICI(I)
DA = A(I,
J) - CCII# J)
= WSS(II)
WSS(II)
+ DA * DA
190 COTlINUE
RETURN
END
This content downloaded from 143.107.12.71 on Sat, 19 Dec 2015 14:17:36 UTC
All use subject to JSTOR Terms and Conditions
106
APPLIED STATISTICS
C
C
C
C
C
C
C
C
C
DEFINE
NCP(K)
NUMBER
(NC(LI)
LIVE(L)
- M + 1-
IN GTO
90
STAGE
1, N
DB = A(I,
J. - C(IR,
DA = DA + M * D
NO.1
C
C
VOL.28,
(1979)
C
C
C
C
C
C
STATIST,
NC, ANI,
C
C
C
C
C
C
C
C
C
APPL.
IC2,
J)
40 CONTINUE
R2 = DA * AN2()
Do 60 L
Is K
C
C
C
C
C
C
IF
(I
.GE,
LIV,E(L)
AND.
I .GE.
LIVE(L)
OR.
DC = 0.0
DO 50 J = 1, NJ
DD = A(I,
J) - C'L,
DC = DC + DD * DD
n
IF (DC .GE. RR) GA
3)
6o
This content downloaded from 143.107.12.71 on Sat, 19 Dec 2015 14:17:36 UTC
All use subject to JSTOR Terms and Conditions
107
STATISTICAL ALGORITHMS
50 CONTINUE3
R2 = DC * AN2(L)
L2 = L
6o CoNTINUE3
IF
C
C
C
C
C
C
C
(R2
.LT.
DIV)
G(FO 70
IF NO TRAiNSFER IS
NECESSARY,
= J2
IC2(I)
GTrO go
UPDATE CLJSTER CENTRES,
FOR CLUSTERS L AND 12,
70 INDIX = 0
= M+ I
LIVE(L1)
= M + I
LIVEL'T)
= I
NCP(L1)
= I
NCP(LE)
ALl = NC(1)
ALY = ALA - 1. 0
ALE2 = C S(2)
ALT = ALTE?+ 1,0
1, N
DO 8o0J
C(L1,
C(L2,
J) = (C(L,r
J) - (C(2,
80 CONTINUE)
J) * ALI - A(I,
J) * AL9E + A(I,
1
NC(L1) = NCC(L1I
= NC(1 2) + 1
NC(L2)
= ALW / ALL
AN2 (LI)
AN1(LIL = BIG
= A1T
AN(L1
IF (ALW GT. 1.0)
= ALT / AU
AN1(L2)
= ALT / (ALT + 1,0)
AN2(L2)
= 2
IC1(I)
= LI
IC2(I)
90 COnUE=
IF (INDEX .EQ, MN RETURN
100 CONITINUE
1, 1
DOI 110 L
C
C
C
C
C
0
ITRAN (L)
- LIVE(L)
LIVE(L!
110 C(ONTINUE
RETURIN
END
C
C
C
C
C
C
C
C
C
C
C
C
C
(ALW
ALW
ALT
1.0)
J))
J))
- M
SUBROlTINE QTRAN(A, M, N, C, K,
ANE, NCP, D, ITRAN, INDEX)
ALORITIJULAS 136.?.
IC1,
IC2,
NC, AN1,
VOL.28, NO.1
IC2(M),
A(V, N), IC1(M),
C(KC, N1), NC(KC), AN1(1),
D(M)
AN2(K),
NCP(K),
ITRAN(K)
NUMIIER
/1,OE1O/
This content downloaded from 143.107.12.71 on Sat, 19 Dec 2015 14:17:36 UTC
All use subject to JSTOR Terms and Conditions
APPLIED STATISTICS
108
C
C
C
C
C
ICOUN
ISTEP
=
=
0
0
10 DO 70 I = 1, M
ICWUN ICOlUN+ 1
ISTEP + 1
ISTEP
LI = IC1(I)
L2 = ICZ(I)
C
C
*EQ. 1) GOTO 6
IF (NC(.L1)
C
C
C
C
C
C
NO TRANSFER
IF (ISTEP
0.0
DO 20 J =
DB = A(I,
DA = DA +
20 CONTIlUE
D(I) = DA
DA =
C
C
C
.GT. NCP(L1))
1, NJ
J) - C(LX,
DO * DO
*
GOTO 30
J)
ANl(Li)
30 IF (ISTEP
R2 = D(I)
.AND. ISTEP
.GE. NCP(i)
/ AN2(L2)
GE. NCP(L2))
GOTo
6o
DD = 0.0
DO 40 J = 1, N
DE = A(I, J) - C(L2,
DD =
DO + DE
* DE
IF (DD
GE. R12) GOTr
40 COtNTINUE
C
J)
Go
C
C
C
C
C
0
0
INDEX
ITRAN(Li) = 1
ICOUN
= 1
ITRAWT(2)
NCP(L1) = ISTEP + M
NCP(L2) = ISTEP + M
ALi = N4C
(LI)
ALW = ALL - 1.0
AI2 = NC(U.)
ALT = AL2 + 1. 0
DO 507 = 1, N
J) * ALI - A'I, J.) / ALU
C(LI, J) = (C(Li,
(C(12, J) * ALS + A(I, J)) / ALT
C(L2, J)
50 CONTINIUE
- 1
=NC(Li
NC(LI)
NC(L2) = NC(12) + 1
=
ALU( / ALi
AN2(L1)
AN1(LU) = BIG
= ALW / (ALW - 1.0)
IF (ATJ .GT. 1.0) AN1()
AN1(L-2) = ALT / ALI
AN2(12) = ALT / t(ALT + 1.0)
-= 2
ICiCI)
IC2(I)
C
I,
C
C
6o
IF (ICOUN .EQ.
70 CONTINUE
GOYTOl
10
IND
RERN
M) RETURN
This content downloaded from 143.107.12.71 on Sat, 19 Dec 2015 14:17:36 UTC
All use subject to JSTOR Terms and Conditions