A K-Means Clustering Algorithm

Algorithm AS 136: A K-Means Clustering Algorithm
Author(s): J. A. Hartigan and M. A. Wong

Source: Journal of the Royal Statistical Society. Series C (Applied Statistics), Vol. 28, No. 1 (1979)
, pp. 100-108
Published by: Wiley for the Royal Statistical Society
Stable URL: http://www.jstor.org/stable/2346830
Accessed: 19-12-2015 14:17 UTC
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at http://www.jstor.org/page/
info/about/policies/terms.jsp
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content
in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship.
For more information about JSTOR, please contact support@jstor.org.
Royal Statistical Society and Wiley are collaborating with JSTOR to digitize, preserve and extend access to Journal of the
Royal Statistical Society. Series C (Applied Statistics).
http://www.jstor.org
This content downloaded from 143.107.12.71 on Sat, 19 Dec 2015 14:17:36 UTC
All use subject to JSTOR Terms and Conditions
APPLIED STATISTICS
100
FIND WIMUM ENTRY
C
C
6o PIvcr = ,xC
KK = 0
DO 70 I = II, s
K = INDEX(I)
IF (ABS(LU(IC, IIl
PIVOT = ABS(W(K,
KK
C
c
C
C
C
.LE. PIVOT) GOTO70

IIM
70 CONTIlUE
IF (IE *EQ. 0) GOTO 10
SWITCHORDER
ISAVE = INDEX(ltK)
= INDEX('II
INDMEX(KiC)
INDEX(II) = ISAVE
PUT IN COUJimSor LU ONE AT A TIME
IR
IF (INTIA IIBASE(II)
IF (II *EQ. MNGOTO90
J = II + 1
ix) 80 I = it M
K = INDEXVI)
= W(E,
LU(E, II)
80 CONTINUE
II
/ LU(ISAVE,
II)
90 CCNTINUE
EKE = IRCW
RETURN
END
AlgorithmAS 136
A K-MeansClustering
Algorithm
By J. A.
HARTIGAN
and M. A.
WONG
New Haven,Connecticut,
Yale University,
U.S.A.
Keywords: K-MEANS CLUSTERING ALGORITHM; TRANSFER ALGORITHM
LANGUAGE
ISO Fortran
DESCRIPTION AND PURPOSE
The K-meansclustering
algorithmis describedin detailby Hartigan(1975). An efficient
versionof thealgorithmis presentedhere.
The aim of the K-meansalgorithmis to divideM pointsin N dimensionsintoK clusters
sum of squaresis minimized.It is not practicalto requirethatthe
so thatthe within-cluster
solutionhas minimalsum of squares againstall partitions,
exceptwhenM, N are smalland
K = 2. We seek instead"local" optima,solutionssuchthatno movementofa pointfromone
sum of squares.
clusterto anotherwill reducethe within-cluster
METHOD
The algorithmrequiresas inputa matrixof M pointsin N dimensionsand a matrixof

K initialclustercentresin N dimensions.The numberof pointsin clusterL is denotedby
NC(L). D(I, L) is theEuclideandistancebetweenpointI and clusterL. The generalprocedure
is to searchfora K-partitionwithlocallyoptimalwithin-cluster
sum of squares by moving
pointsfromone clusterto another.
STATISTICAL
ALGORITHMS
101
Step 1. For each pointI (I = 1,2, ..., M), finditsclosestand secondclosestclustercentres,

IC1(I) and IC2(I) respectively.AssignpointI to clusterICI(I).
Step 2. Update theclustercentresto be the averagesof pointscontainedwithinthem.
Step 3. Initially,all clustersbelongto thelive set.
Step 4. This is the optimal-transfer
(OPTRA) stage:
Considereach pointI (I = 1,2, ..., M) in turn. If clusterL (L = 1,2, ..., K) is updatedin the
last quick-transfer
(QTRAN) stage, then it belongs to the live set throughoutthis stage.
Otherwise,
at each step,itis notin thelivesetifithas notbeen updatedin thelastM optimaltransfersteps. Let point I be in clusterLI. If LI is in the live set, do Step 4a; otherwise,
do Step 4b.
Step4a. Computetheminimumofthequantity,R2 = [NC(L) * D(I, L)2]/[NC(L)+ 1],over
all clustersL (L LI, L= 1,2, ..., K). Let L2 be the clusterwiththe smallestR2. If this
value is greaterthanor equal to [NC(L1) * D(I, Ll)2]/[NC(Ll) - 1], no reallocationis necessary
andL2 is thenewIC2(I). (Note thatthevalue [NC(L1) * D(I,LI)2]/[NC(Ll) - 1] is remembered
and willremainthesameforpointI untilclusterLI is updated.) Otherwise,
pointI is allocated
to clusterL2 and LI is thenew IC2(I). Clustercentresare updatedto be themeansof points
assignedto themif reallocationhas takenplace. The two clustersthat are involvedin the
transfer
of pointI at thisparticularstepare now in theliveset.
Step 4b. This stepis the same as Step 4a, exceptthatthe minimumR2 is computedonly
over clustersin the live set.
Step 5. Stop if the live set is empty.Otherwise,go to Step 6 afterone pass throughthe
data set.
Step 6. This is thequick-transfer
(QTRAN) stage:
Considereach pointI (1 = 1,2, ..., M) in turn. Let LI = IC1(I) and L2 = IC2(I). It is not
necessaryto checkthe pointI ifboth theclustersLI and L2 have not changedin the last M
steps. Computethevalues
RI = [NC(L1) * D(I,L1)2]/[NC(L1)- 1] and R2 = [NC(L2) * D(I,L2)2]/[NC(L2)+ 11.
(As noted earlier,RI is remembered
and will remainthe same untilclusterLI is updated.)
If RI is less thanR2, pointI remainsin clusterL1. Otherwise,switchICl (I) and IC2(I) and
updatethecentresof clustersLI and L2. The twoclustersare also notedfortheirinvolvement
in a transfer
at thisstep.
tookplace in thelast M steps,go to Step4. Otherwise,
Step 7. If no transfer
go to Step 6.
STRUCTURE
SUBROUTINE KMNS (A, M, N, C, K, ICI, IC2, NC, AN1, AN2, NCP, D, ITRAN, LIVE,
ITER, WSS, IFAULT)
Formalparameters
A
Real array(M, N)
M
Integer
N
Integer
Real array(K, N)
C
K
IC1
IC2
Integer
Integerarray(M)
Integerarray(M)
NC
AN1
AN2
Integerarray(K)
Real array(K)
Real array(K)
the data matrix

the numberof points
the numberof dimensions
thematrixof initialclustercentres
the matrixof finalclustercentres
thenumberof clusters
the clustereach pointbelongsto
thisarrayis used to remember
theclusterwhich
each pointis mostlikelyto be transferred
to at
each step
output: the numberof pointsin each cluster
workspace:
workspace:
input:
input:
input:
input:
output:
input:
output:
workspace:
102
NCP
D
ITRAN
LIVE
ITE R
WSS
IEA ULT
APPLIED
Integerarray(K)
Real array(M)
Integerarray(K)
Integerarray(K)
Integer
Real array(K)
Integer
STATISTICS
workspace:
workspace:
workspace:
workspace:
input: themaximumnumberof iterationsallowed
output: thewithin-cluster
sumofsquaresofeach cluster
output: see Fault Diagnosticsbelow
FAULT DIAGNOSTICS
IFA ULT -0 No fault

IFA ULT 1 At leastone clusteris emptyaftertheinitialassignment.(A bettersetofinitial
clustercentresis called for)
IFA ULT = 2 The allowedmaximumnumberof iterationsis exceeded
IFA ULT = 3 K is less thanor equal to 1 or greaterthanor equal to M
Auxiliaryalgorithms
The followingauxiliaryalgorithmsare called: SUBROUTINE OPTRA (A, M, N, C, K,
IC1, IC2, NC, AN1, AN2, NCP, D, ITRAN, LIVE, INDEX) and SUBROUTINE QTRAN
(A, M, N, C, K, IC1, IC2, NC, AN1, AN2, NCP, D, ITRAN, INDEX) whichare included.
RELATED ALGORITHMS
is AS 113(A transfer
fornon-hierarchial
A relatedalgorithm
classification)
given
algorithm
uses swopsas wellas transfers
to tryto overcome
byBanfieldand Bassill(1977). Thisalgorithm
theproblemof local optima;thatis, forall pairsof points,a testis made whetherexchanging
more
theclustersto whichthepointsbelongwillimprovethecriterion.It willbe substantially
expensivethanthepresentalgorithmforlargeM.
AS 58 (Euclideanclusteranalysis)givenby
The presentalgorithmis similarto Algorithm
aim at finding
a K-partition
of thesample,withwithin-cluster
Sparks(1973). Bothalgorithms
sum of squares whichcannot be reducedby movingpointsfromone clusterto the other.
of AlgorithmAS 58 does not satisfythis condition. At the
However,the implementation
stage whereeach point is examinedin turnto see if it should be reassignedto a different
cluster,onlythe closestcentreis used to checkforpossiblereallocationof the givenpoint;
a clustercentreother than the closest one may have the smallestvalue of the quantity
+ 1)} dI2,wheren,is thenumberof pointsin cluster/and di is thedistancefromclusterI
{nl/(n1
to the givenpoint. Hence, in general,AlgorithmAS 58 does not providea locally optimal
solution.
are testedon variousgenerateddata sets. The timeconsumedon the
The two algorithms
sum of squares of the resultingK-partitions
IBM 370/158and thewithin-cluster
are givenin
Table 1. While comparingthe entriesof the table, note that AS 58 does not give locally
forthe
optimalsolutionsand so shouldbe expectedto take less time. The WSS are different
two algorithmsbecause theyarriveat different
partitionsof the sets of points. A savingof
about 50 per centin timeoccursin KMNS due to using"live" setsand due to usinga quickiterationsby a factorof 4. Thus,
transfer
stagewhichreducesthenumberof optimaltransfer
KMNS comparedto AS 58 is locallyoptimaland takesless time,especiallywhenthenumber
of clustersis large.
TIME AND AccupAcY
The timeis approximately
equal to CMNKI whereI is the numberof iterations.For an
data structures
IBM 370/158,C = 21 x 10-5sec. However,different
requirequite different
numbersof iterations;and a carefulselectionof initialclustercentreswill also lead to a
considerablesavingin time.
Storagerequirement:
M(N+ 3) + K(N+ 7).
STATISTICAL
ALGORITHMS
103
TABLE 1
Time(sec)
WSS
1. M = 1000, N = 10, K = 10
AS 58
63-86
7056-71
2. M = 1000,N = 10, K = 10
AS 58
KMNS
4349
19 11
7779*70
7822-01
3. M = 1000, N = 10, K = 50
AS 58
135-71
4543-82
4. M = 1000,N = 10,K = 50
AS 58
KMNS
95-51
5131P04
5. M = 50, N = 2, K = 8
AS 58
0-17
21-03
(random
spherical
normal)
KMNS
(twowidelyseparatedrandomnormals)
(random
spherical
normal)
KMNS
(twowidely
separated
random
normals)
random
(twowidely
separated
normals)
KMNS
36-66
76-00
57-96
7065-59
456148
5096-23
0-18
21V03
Missingvariatevalues cannotbe handledby thisalgorithm.

The algorithmproducesa clustering
whichis onlylocallyoptimal;thewithin-cluster
sum
of squares may not be decreasedby transferring
a point fromone clusterto another,but
different
partitionsmayhave the same or smallerwithinclustersum of squares.
The numberof iterationsrequiredto attainlocal optimalityis usuallyless than 10.
ADDITIONALCOMMENTS
One way of obtainingthe initial clustercentresis suggestedhere. The points are

firstordered by their distancesto the overall mean of the sample. Then, for cluster
L (L = 1,2, ..., K), the {1 + (L -1) * [M/K]}thpointis chosen to be its initialclustercentre.
In effect,
someK samplepointsarechosenas theinitialclustercentres.Usingthisinitialization
process,it is guaranteedthat no clusterwill be emptyafterthe initialassignmentin the
whichis dependenton theinputorderof thepoints,takes
subroutine.A quick initialization,
the firstK pointsas the initialcentres.
ACKNOWLEDGEMENTS
This researchis supportedby National ScienceFoundationGrantMCS75-08374.

REFERENCES
AS113. A transfer
C. F. and BASSILL,L. C. (1977). Algorithm
algorithm
fornon-hierarchical
classification.
Appl.Statist.,26, 206-210.
New York: Wiley.
HARTIGAN,J.A. (1975). Clustering
Algorithms.
AS 58. Euclideanclusteranalysis.Appl.Statist.,22, 126-130.
SPARKS,D. N. (1973). Algorithm
BANFIELD,
*
C
SUBROUTINEKIRTS(A, M, N, C, K, ICI, ICZ, NC, ANI, AN2, NCP,

D, ITRAN, LIVE, ITER, WSS, IFAULT)
ALGORITHMAS 136 APPL. STATIST. (1979) VOL.28, NO.1
C
C
DIVIDE M POINTS IN N-DIMENSIONALSPACE INTO K CWSTERS

SO THIATTHE WITHIN CUJSTERSUM OF SQUARESIS MINIMIZED.
C
C
C
DIMENSION A(M, N, ICI(Ml, IC2(M), D(Ml

DIMENSIONC(K, N), NC(K), AN1(K), AN2(K), NCP(K)
DIMENSION ITRAN(K), LIVE(K), WSStK), DT(2)
DEFINE BIG TO BE A VERY LARGEPOSITIVE NUMBER
DATABIG /1,0E1O/
104
APPLIED
IFAULT = 3
C
C
C
C
IF
(I
LE.
OR, K
FOR EACH POITr I,

AND IC2(I).
IC1(I)
STATISTICS
GE, M) RETURN
FIND ITS TWO CLOSEST
ASSIGN IT TO IC1(I).
CENTRES,
DO 50 I = 1, M
= 1
ICl(I)
= 2
IC2(I)
DO 10 IL-- 1, 2
= 0.0
DT(IL)
DO 10 J = 1, N;
DA = A(I,
J) - C(IL,
J)
= DT(IL)
DT(IL)
+ DA * DA
10 COtNTINUE
IF (I)T(1)
.
T(2!)) GOTO 2o
.
= 2
ICI(I)
=
1
IC2(I)
TEMP = DT(1
DT(1)
DT(2)
= TEMP
DT()
20 DO 50 L = 3, K
DB = 0.0
DO 30 J = 1, NJ
C
C
C
C
C
C
c
C
C
c
C
C
C
C
DC = A(I,
J) - C(L, J)
DB = DB + DC * DC
IF (DO .GE. DT(2))
GOTO 50
30 CONTINUE
IF (DB *LT. DT(1)*
GCTO 40
= D
DT(2)
IC2(I
= L
GOTO 50
= DT(1l
40 DT(2)
= ICI(I)
IC2(I)
=
DO
DT(l)
= L
IC1(I)
50 C0NTIN4UE
UPDATE CUISTER CENTRES TO BE TIIE AVERAGE
(OF POINTS CONTAINED WITHIN THEM
DO 70 L
1, K
NC(L) = 0
= 1, tN
DO 6o
6o C(L, J) = 0.0
70 CONTINUE
DO 90 I = 1, M
L = IC(I)
TIC(L) = NC(L) + i
D)O So 3 = 1, 1N
80 C(L, J) = C(L, J) + A(I,
90 CONTINTUE
3)
CHIECK TO SEE IF THIERE IS
ANY EMPTY CLUSTER AT THIS
STAGE
IFAULT = I
K
DO 100 L =1
IF (NC(L
*EQ. 01 RETURN
100 CONTINUE
IFAULT = 0
DO 120 L = 1, IC
AA = NC(L)
1, N
DO 110 J
110 C(L, J) - C(L, J) / AA
INITIALIZE
ANI(L)
IS
AN2(L) IS
ANI, AN2, ITRiAN AND NCP

- 1)
EQUAL TO NCML) / (NC(L)
EQUAL TO NC'L) / (NC(Ll + 1)
IF CLUSTER L IS UPDATED IN THE QUICiK-TRtANSFER STAGE
ITRAN(L)=l
ITRAlTN'L)=0 OTHJERWISE
IN THE1DOPTIMALITRANSFER STAGE, NCP(L) INDICATES THE STEP AT
WIiICHI CLUSTER L IS LAST UPDAITED
STATISTICAL
C
C
C
C
C
C
C
C
C
= AA / CM + 1.0)
AN2(L
AN1(L) = BIG
AN1(L) =
IF (AA GT. 1.0)
ITRAN(L)
I1
NCP(LL = -1
120 CONTINUE
INDEX = 0
DO 140 IJ _ 1, ITER
C
C
C
C
C
C
C
C
/ (AA - 1.0)
CALL OPTRA(A, M, N, C, K, ICI,

D, ITRAN, LIVE, INDEX)
IF (INDEX
IC2,
NC, ANI,
ANZ, NCP,
) GOtO 150
EAXCHPOINT IS TESTED IN TURN TO SEE IF IT SHOULD BE

REALDOCATED TO THE CLUSTER WIIICH IT IS MOST LIKELY TO
BE TRANSFERRED TO (IC2(I))
FROM ITS PRESENT CLUSTER (ICI(I)).
LOOP THIROUGHITHE DATA UNTIL NO FURTHER CHANGE IS TO TAKE PLACE
CALL QTRAN(A, Nl, N, C, K, ICi,

NCP, D, ITRAN, INDEX)
C
C
C
STOP IF NO] TRANSFER TOOK PLACE IN THE LAST M

OPTIMAL-TRANSFER STEPS
C
C
C
C
C
C
C
C
105
IN TIHIS STA\GE, THERE IS ONLY ONE PASS THROUGH THI DATA.

EACIH POINT IS REALLOCATED, IF NECESSARY, TO TIIE CLUSTER
TIHAT WILLi INDUCE THE MAXIMUMREDUCTION IN WITIIIN-CLUSTER
SUM OF SQUARES
C
C
C
C
ALGORITHMS
IN THE QUICK-TRANSFER STAGE, NCP(L) IS EQUAL TO THE STEP AT

WHICH CLUSTER L IS LAST UPDATED PLUS M
IC2,
NC, ANI,
AN2,
IF THERE ARE ONLY TWO CLUSTERS,

NO NEED TO RE-ENTER OPTIMAL-TRANSFER STAGE
(KC EQ. 2)
IF
GOT 150
NCP HIAS TO BE SET TO 0 BEFORE ENTERING OPTRA
DO 130 L 1, K
130 NCP W = 0
140 CONTIINUE
SINCE TlE SPECIFIED
NUMBER OF ITERATIONS
IFAULT IS SET TO BE EQUAL TO 2.
MAY INDICATE UNFORESEEN LOPING
TIIS
IS EXCEEDED
IFAULT = 2
CUMPUTE WITHIIN CLUSTER SUM OF SQUARES FOR EACH CLUSTER
150 Do 160 L
K
1-,
WSS(L) S 0.0
jX i6o J = 1, N
0.0
C(L, J)
16o CoNTINUE
DO 170 I
1,
It
II = ICi(I)
DO 170 J = 1, N
C(II,
J) = C(II,
170 CONTINUE
J) + ACI,
J)
DO 10 J
1, N
1, K
DOt 130 L
180 CML, J) = C(L, J) / FLOAT(NC(L))
DO 190 I = 1, M
II = ICI(I)
DA = A(I,
J) - CCII# J)
= WSS(II)
WSS(II)
+ DA * DA
190 COTlINUE
RETURN
END
106
APPLIED STATISTICS
C
C
C
C
C
C
C
C
C
SUBROUTINE OPTRA(A, M, N, C, K, ICI,

* AN2, NCP, D, ITRAN, LIVE, INDEX)
AIM(JRITHM AS 136.1
THIS
DEFINE
BIG TO BE A VERY IARGE POSITIVE
NCP(K)
NUMBER
IF CLUSTER L IS UPDATED IN THE LAST QUICK-TRANSFER STAGE,

IT BELONGS TO THE LIVE SET TIHROGHOUT THIS STAGE.
OTIHE-RVISE, AT EACH STEP, IT IS NOT IN THE LIVE SET IF IT
HItS NUT BEEN UPDATED IN THE LAST M OPTIMAL-TRANSFER STEPS
DO 10 L = 1, K
IF (ITRAN(L)
EQ. 1)
10 CONTINIJE
DO 100 I =1,
m
INDEX = INDEX, + 1
Li = 1C1(I)
I2 = IC2(I)
LL =
IF POINT
IF
(NC(LI)
LIVE(L)
- M + 1-
I IS THE ONLY MEMBER OF CWSTER IA, NO TRANSFER

*EQ.
IN GTO
90
IF L IHAS NUT YET BEEN UPDATED IN THIS

NO NE-ED TO RECOMPUTE D(I)
STAGE
IF (NCP(LI) .EQ. 0) GOTO30

DE =. 0.0
DO 20 J = 1, N
J
DF = A(I,
J - C(I=,
DE = DE + DF * DF
20 CONfTINUE
D(I) = DE * AN1(L1,
FIND TIM CLUSTERWITH MINIMUMf2
30 BJA= 0.0
DO 40 J
1, N
DB = A(I,
J. - C(IR,
DA = DA + M * D
NO.1
DATA BIG /1.OE1O/
C
C
VOL.28,
IS THE OPTIMAL-TRANSFER STAGE
(1979)
DIMENSION A(M, N), IC1(M),

IC2WMv, D(M)
DIMENSION C(K, N), NC(KW, AN1(K), AN2(K),
DIMENSION ITRAN(K),
LIVE(K)
C
C
C
C
C
C
STATIST,
NC, ANI,
EACII POINT IS REALWCATED, IF NECESSARY, TO THE

CLJSTER THlAT WILL INDUCE A MAXIMUMREDUCTION IN
TH}I1 WITHIN-CLUSTER SUM OF SQUlARES
C
C
C
C
C
C
C
C
C
APPL.
IC2,
J)
40 CONTINUE
R2 = DA * AN2()
Do 60 L
Is K
C
C
C
C
C
C
IF I IS GsREATERTHAN OR EQUAL TO LIVE.Li),

THEN tl IS
NUT IN TIHE LIVE SET.
IF THIS IS TRUE, WE ONLY NEED TO
CONSIDER CUJSTERS TIAT ARE IN TIE LIVE SET FOR POSSIBLE
TRANSFER OF POINT I.
UTHERWISE, WE NEED TO CONSIDER
ALL POSSIBLE CLUSTERS
IF
(I
.GE,
LIV,E(L)
AND.
I .GE.
LIVE(L)
i OR. L ,EQ. LL) GUTO60

L .EQ.
RR = NT / AN2(L)
OR.
DC = 0.0
DO 50 J = 1, NJ
DD = A(I,
J) - C'L,
DC = DC + DD * DD
n
IF (DC .GE. RR) GA
3)
6o
107
STATISTICAL ALGORITHMS
50 CONTINUE3
R2 = DC * AN2(L)
L2 = L
6o CoNTINUE3
IF
C
C
C
C
C
C
C
(R2
.LT.
DIV)
G(FO 70
IF NO TRAiNSFER IS
NECESSARY,
12 IS THE NE?t IC2(I)
= J2
IC2(I)
GTrO go
UPDATE CLJSTER CENTRES,
FOR CLUSTERS L AND 12,
LIVE, NCP, AN1 AND AN2

AND IC2(I)
AND) UPDAT13 IC1(I)
70 INDIX = 0
= M+ I
LIVE(L1)
= M + I
LIVEL'T)
= I
NCP(L1)
= I
NCP(LE)
ALl = NC(1)
ALY = ALA - 1. 0
ALE2 = C S(2)
ALT = ALTE?+ 1,0
1, N
DO 8o0J
C(L1,
C(L2,
J) = (C(L,r
J) - (C(2,
80 CONTINUE)
J) * ALI - A(I,
J) * AL9E + A(I,
1
NC(L1) = NCC(L1I
= NC(1 2) + 1
NC(L2)
= ALW / ALL
AN2 (LI)
AN1(LIL = BIG
= A1T
AN(L1
IF (ALW GT. 1.0)
= ALT / AU
AN1(L2)
= ALT / (ALT + 1,0)
AN2(L2)
= 2
IC1(I)
= LI
IC2(I)
90 COnUE=
IF (INDEX .EQ, MN RETURN
100 CONITINUE
1, 1
DOI 110 L
C
C
C
C
C
0
ITRAN (L)
- LIVE(L)
LIVE(L!
110 C(ONTINUE
RETURIN
END
C
C
C
C
C
C
C
C
C
C
C
C
C
(ALW
ALW
ALT
1.0)
ITRAN(Ll IS SET TO ZERO BEFORE ENTERING, QTRAXL

liAS TO BE DECREASED BY M BEFORE
ALSO, LIVE(L)
RE-ENTERING OPTRAt
J))
J))
- M
SUBROlTINE QTRAN(A, M, N, C, K,
ANE, NCP, D, ITRAN, INDEX)
ALORITIJULAS 136.?.
IC1,
IC2,
NC, AN1,
APPL. STATIST. (17q)
VOL.28, NO.1
TIIIS IS TIIE QUICK TRANiSFER STAGE.

IS TIHE CLJSTER WHIICII POINT I BELONGS TO.
IC1(IN
IS TE CLUJSTER WHIICHIP0INT I IS MOST
IC2(I)
LIKEL.Y TO BE TR.ANSFERRED TO.
AND IC2(I*
ARE SWITCHED, IFs
FOR EACH POINT I, IC1(I)
NECESSARY, TO REDUCE WITHIN CLUSTER SUM OF SQUARES.
THE CLJUSTER CENTRES ARE UPDATED AFTE1R EACH STEP
DIMENSION
DIMENSIOi
DEFINE
DATA BIG
IC2(M),
A(V, N), IC1(M),
C(KC, N1), NC(KC), AN1(1),
D(M)
AN2(K),
BIG TO BE A VERY LRGlE POSITIVE
NCP(K),
ITRAN(K)
NUMIIER
/1,OE1O/
APPLIED STATISTICS
108
IN THE )OPTIMAL-TRANSFERSTAGE, NCP(L) INDICATES THE

STEP AT WHICHICUJSTER L IS LAST UPDATED
IN TIHE QUICIC-TRANSFER STAGE, NCP(LN IS EQUAL TO THE
STEP AT WHICII CLUSTER L IS LAST UPDATED PLUS M
C
C
C
C
C
ICOUN
ISTEP
=
=
0
0
10 DO 70 I = 1, M
ICWUN ICOlUN+ 1
ISTEP + 1
ISTEP
LI = IC1(I)
L2 = ICZ(I)
IF POINT I IS THE ONLY MEMBEROF CUJSTER Li,
C
C
*EQ. 1) GOTO 6
IF (NC(.L1)
C
C
C
C
C
C
NO TRANSFER
IF IST1EP IS GREATER THIANNCP(LA), NO NEED TO RECOMPUTE

DISTANCE FROM4POINT I TO CLUSTER LI
NOTE THAT IF CLUSTER LI IS LAST UPDATED EXACTLY MASTEPS
AGO WE STILL NEED TO COMPUTE THE, DISTANCE FROM POINT I
TO CLUSTER LI
IF (ISTEP
0.0
DO 20 J =
DB = A(I,
DA = DA +
20 CONTIlUE
D(I) = DA
DA =
C
C
C
.GT. NCP(L1))
1, NJ
J) - C(LX,
DO * DO
*
GOTO 30
J)
ANl(Li)
IF ISTEP IS GREATER THAN OR EQUAL TO BOTl NCP(Li) AND

NCP(12) THERE WILL BE NO TRANSFER OF POINT I AT THIS STEP
30 IF (ISTEP
R2 = D(I)
.AND. ISTEP
.GE. NCP(i)
/ AN2(L2)
GE. NCP(L2))
GOTo
6o
DD = 0.0
DO 40 J = 1, N
DE = A(I, J) - C(L2,
DD =
DO + DE
* DE
IF (DD
GE. R12) GOTr
40 COtNTINUE
C
J)
Go
UPD}ATECLJUSTERCENTRES, NCP, NC, ITRAN, ANI AND AN2

AND IC2(I).
FOR CLlJSTERS LA AND 12.
ALSO, UPDATE IC1(I)
NOTE THIAT IF Al4Y UPDATING OCCURS IN THIS STAGE,
INDEX IS SET BACK TO 0
C
C
C
C
C
0
0
INDEX
ITRAN(Li) = 1
ICOUN
= 1
ITRAWT(2)
NCP(L1) = ISTEP + M
NCP(L2) = ISTEP + M
ALi = N4C
(LI)
ALW = ALL - 1.0
AI2 = NC(U.)
ALT = AL2 + 1. 0
DO 507 = 1, N
J) * ALI - A'I, J.) / ALU
C(LI, J) = (C(Li,
(C(12, J) * ALS + A(I, J)) / ALT
C(L2, J)
50 CONTINIUE
- 1
=NC(Li
NC(LI)
NC(L2) = NC(12) + 1
=
ALU( / ALi
AN2(L1)
AN1(LU) = BIG
= ALW / (ALW - 1.0)
IF (ATJ .GT. 1.0) AN1()
AN1(L-2) = ALT / ALI
AN2(12) = ALT / t(ALT + 1.0)
-= 2
ICiCI)
IC2(I)
C
I,
IF NO REALUICATION TOOK PLACE IN TliE LAST M STEPS,
C
C
6o
IF (ICOUN .EQ.
70 CONTINUE
GOYTOl
10
IND
RERN
M) RETURN

A K-Means Clustering Algorithm

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A K-Means Clustering Algorithm

Uploaded by

Copyright:

Available Formats

Algorithm AS 136: A K-Means Clustering Algorithm

Author(s): J. A. Hartigan and M. A. Wong

FIND WIMUM ENTRY

.LE. PIVOT) GOTO70

The algorithmrequiresas inputa matrixof M pointsin N dimensionsand a matrixof

Step 1. For each pointI (I = 1,2, ..., M), finditsclosestand secondclosestclustercentres,

the data matrix

IFA ULT -0 No fault

Missingvariatevalues cannotbe handledby thisalgorithm.

One way of obtainingthe initial clustercentresis suggestedhere. The points are

This researchis supportedby National ScienceFoundationGrantMCS75-08374.

SUBROUTINEKIRTS(A, M, N, C, K, ICI, ICZ, NC, ANI, AN2, NCP,

ALGORITHMAS 136 APPL. STATIST. (1979) VOL.28, NO.1

DIVIDE M POINTS IN N-DIMENSIONALSPACE INTO K CWSTERS

DIMENSION A(M, N, ICI(Ml, IC2(M), D(Ml

FOR EACH POITr I,

CHIECK TO SEE IF THIERE IS

ANY EMPTY CLUSTER AT THIS

ANI, AN2, ITRiAN AND NCP

CALL OPTRA(A, M, N, C, K, ICI,

EAXCHPOINT IS TESTED IN TURN TO SEE IF IT SHOULD BE

CALL QTRAN(A, Nl, N, C, K, ICi,

STOP IF NO] TRANSFER TOOK PLACE IN THE LAST M

IN TIHIS STA\GE, THERE IS ONLY ONE PASS THROUGH THI DATA.

IN THE QUICK-TRANSFER STAGE, NCP(L) IS EQUAL TO THE STEP AT

IF THERE ARE ONLY TWO CLUSTERS,

NCP HIAS TO BE SET TO 0 BEFORE ENTERING OPTRA

SUBROUTINE OPTRA(A, M, N, C, K, ICI,

BIG TO BE A VERY IARGE POSITIVE

IF CLUSTER L IS UPDATED IN THE LAST QUICK-TRANSFER STAGE,

I IS THE ONLY MEMBER OF CWSTER IA, NO TRANSFER

IF L IHAS NUT YET BEEN UPDATED IN THIS

IF (NCP(LI) .EQ. 0) GOTO30

DATA BIG /1.OE1O/

IS THE OPTIMAL-TRANSFER STAGE

DIMENSION A(M, N), IC1(M),

EACII POINT IS REALWCATED, IF NECESSARY, TO THE

IF I IS GsREATERTHAN OR EQUAL TO LIVE.Li),

i OR. L ,EQ. LL) GUTO60

12 IS THE NE?t IC2(I)

LIVE, NCP, AN1 AND AN2

ITRAN(Ll IS SET TO ZERO BEFORE ENTERING, QTRAXL

APPL. STATIST. (17q)

TIIIS IS TIIE QUICK TRANiSFER STAGE.

BIG TO BE A VERY LRGlE POSITIVE

IN THE )OPTIMAL-TRANSFERSTAGE, NCP(L) INDICATES THE

IF POINT I IS THE ONLY MEMBEROF CUJSTER Li,

IF IST1EP IS GREATER THIANNCP(LA), NO NEED TO RECOMPUTE

IF ISTEP IS GREATER THAN OR EQUAL TO BOTl NCP(Li) AND

UPD}ATECLJUSTERCENTRES, NCP, NC, ITRAN, ANI AND AN2

IF NO REALUICATION TOOK PLACE IN TliE LAST M STEPS,

You might also like