Only HCM and FCM Final

Hard C Mean and Fuzzy
C Mean Clustering
Professor Ashok Deshpande PhD(Engineering)
Founding Chair: Berkeley initiative in Soft Computing
(BISC)-Special Interest Group (SIG)-Environment
Management Systems (EMS)
Guest Faculty: University of California Berkeley Ca USA
Visiting Professor: University of New South Wales Canberra
Australia and Indian Institute of Technology Mumbai
Adjunct Professor: NIT Silchar and College of Engineering Pune
Former Deputy Director: National Environmental Engineering
research Institute (NEERI) India
So far as the laws of mathematics
refer to reality they are not
certain. And so far as they are
certain, they do not refer to
reality.
Albert Einstein
Theoretical Physicist and Nobel Laureate,
Geometrie und Erfahrung, Lecture tp Prussian Academy,1921
Classification and Clustering
From causes which appear similar, we

expect similar effects. This is the sum
total of all our experimental calculations.
David Hume
Scottish philosopher
Equity Concerning Human Understanding,1748
A Commentary on Clustering
and Classification
There is a structure in nature. Much of this structure

is known to us and is quite beautiful.
Natural sphericity or rain drops and bubbles: why do

balloons take this shape?
Some patterns, such as helical appearance of DNA or

the cylindrical shape of some microscopes.
Consider the geometry and colourful patterns of a

butterflys wings: why do these patterns exist in our
physical world?
and Classification
Answers to some of these questions are still

unknown; many others have been discovered through
increased understanding in physics, chemistry and
biology.
Just as there is structure in nature, we believe

there is an underlying structure in the most of the
phenomena we wish to understand:
Examples
Image recognition
Molecular biology application such as protein folding
and 3D molecular structure.
Oil exploration
Cancer detection and, so on.
and Classification
For fields dealing with diagnosis, we often seek to

find structures in the data obtained from the
observations. (visual, audio or perception based
etc.)
Finding the structure in the data is the essence
of classification.
By finding structures we are classifying the data
according to similar patterns, attributes, features
and other characteristics.
In classifications, also termed as clustering, the
most important issue is deciding what criteria
against to classify against.
What is Cluster Analysis?
Cluster analysis is the name of a group of multivariate
technique whose primary purpose is to identify similar
entities from the characteristic they posses.
Cluster analysis is referred as:

Q-analysis
Typology
Classification analysis
Numerical taxonomy
Cluster Analysis: Objective
Group data points into clusters so that the degree of

association is strong between members of the same
cluster (low variance within) and weak between
members of different clusters (high variance
between).
Where Do We Use Cluster Analysis ?
Sequence Alignment
Phylogenetic Tree Building
Ecological Data Analysis
Grouping patients into disease areas
Physiology and Anatomy
Taxonomy
Chemistry and so on.
Basic Methods
Clustering Formalism
-Three Types
1. Based on Classical methods :
HR and NR Methods such as :

Nearest /farthest neighbour hood or Single
linkage or complete linkage, Hard c Means (or K
Mean Clustering)
2. Fuzzy logic Based Methods:

Fuzzy Equivalence Relations
Fuzzy c means
Clustering Formalism
-Three Types
3. Based on Bioinformatics Concepts:
UPGMA, Transformed distance (TD), Neighbour

Relations (NR), Neighbour Joining (NJ),
Maximum Parsimony (MP), Maximum Likelihood
Estimator (MLE) , Hidden Markov Models (HMM),
Monte Carlo-Markov Chain (MCMC) and many more
Similarity Measures
Similarity Measures or relations, and Distance

measures are the two important facets of any
classification or cluster analysis. Some of the other
aspects include: feature selection, number of classes/
patterns, and alike.
We will discuss one of the methods of similarity

relations before HCM and FCM.
Partitioning
How should inter object similarity be measured?
Similarity measures:
-Closeness or proximity between each pair of objects.

-Distance or the difference between the pairs of
objects.
Since distance is the complement of similarity, this
approach can be used to assess similarity.
Similarity Measures :Variables
of Mixed Type
Gowers similarity measure or Gowers coefficient

p p
sij = wijk sijk/ wijk ,
k=1 k=1
where sijk is the similarity between ith and jth
individuals as measures by the kth variable and wijk
is typically 1 or 0 depending on whether or not the
comparison is considered valid for the kth variable.
Weights zero are assigned when variable k is
unknown for one or both individuals, or to binary
variables where it is required to exclude negative
matches.
Similarity Measures
For variables of Mixed Type
For categorical data the component similarities,

sijk take the values one when the two individuals
have the same value and zero otherwise. For
quantitative variables they are given by
Sijk = 1- xik-xjk /Rk
Where: xik and xjk are the two individuals values for
variable k, and Rk is the range of variable usually in
the set of individuals to be considered.
Data: Five Psychiatrically Ill Patients
Weight Anxiety Depression Halluci Age Group

(lbs) level present? nation
presen
t?
Patient 120 Mild No No Young
1
Patient 150 Moderat Yes No Middle
2 e
Patient 110 Severe Yes Yes old
3
Patient 145 Mild No Yes Old
4
Patient 120 Mild No Yes Young
5
The investigator wishes to exclude negative matches on
depression and hallucinations from the calculation of
between patient similarity.
Gowers similarity measure for each patient:
S12 ={1x (1-30/40) + 1x 0 + 1x 0 + 0x1 + 1x 0 } / [ 1+1
+1 +0 +1 ] =0.0625
S23={1x(1-40/40) + (1x0)+(0x1) +(1x0) +(1x0)}/
1+1+0+1+1 =0
S13 =1x {1-|120-110 | /40} +1x0+1x0+1x0+1x0 =0.150

S14 =1x {1-|120-145 | /40} +0x1+0x1+1x0+1x0 =0.125
S15 =1x {1-|120-120 | /40} +0x1+0x1+1x0+0x1 =0.500
S23 =1x {1-|150-110 | /40} +1x0+0x1+1x0+1x0 =0.000

S24 =1x {1-|150-110 | /40} +1x0+1x0+1x0+1x0 =0.175
S25 =1x {1-|150-120 | /40} +1x0+1x0+1x0+1x0 =0.050
S31 =1x {1-|110-120 | /40} +1x0+1x0+1x0+1x0 = 0.150
S32 =1x {1-|110-150 | /40} +1x0+0x1+1x0+1x0 = 0.000
S34 =1x {1-|110-145 | /40} +1x0+1x0+0x1+0x1 = 0.042
S35 =1x {1-|110-120 | /40} +1x0+1x0+0x1+1x0 = 0.187
S41 =1x {1-|145-120 | /40} +0x1+0x1+1x0+1x0 = 0.125

S42 =1x {1-|145-150 | /40} +1x0+1x0+1x0+1x0 = 0.175
S43 =1x {1-|145-110 | /40} +1x0+1x0+0x1+0x1 = 0.042
S45 =1x {1-|145-120 | /40} +0x1+0x1+0x1+1x0 = 0.187
S51 =1x {1-|120-120 | /40} +0x1+0x1+1x0+0x1 = 0.500

S52 =1x {1-|120-150 | /40} +1x0+1x0+1x0+1x0 = 0.050
S53 =1x {1-|120-110 | /40} +1x0+1x0+0x1+1x0 = 0.187
S54 =1x {1-|120-145 | /40} +0x1+0x1+0x1+1x0 = 0.187
1 2 3 4 5
1 1 0.0625 0.150 0.125 0.500
0.0625 1 0 0.175 .050

2
0.150 0 1 0.042 .187

3
0.125 0.175 0.042 1 .187

4
0..500 0.050 0.187 0.187 1
5
Dissimilarity and Distance Measures
Dissimilarity is the complement of similarity

measures. Some dissimilarity coefficients have
the metric property that
dij + dik > djk for all i, j and k, in which case they
are generally known as distance measures.
Distance measures are most commonly used

measures of similarity between the objects.
Distance Measures
The most commonly used measure of similarity (or

dissimilarity!) is Euclidean Distance (hereafter
referred as ED).
Y
X2 y 2
Object2
Y2-y1
Object1
X2-X1
X1,y1
2-X1)2+ (Y2-Y1)2
Distance=(X
Euclidean Distance
Perhaps the most commonly used distance measure

familiar is Euclidean Distance (hereafter termed as
ED). Use on raw data, however, it may be very
unsatisfactory since its value is largely dependent on
the particular scale selected for the variables.
Example
Weight (lbs) Height (ft/in)
Child 1 60 3.0 ( 36)
Child 2 65 3.5 ( 42)
Child 3 63 4.0 ( 48)
ED are: d12=5.02, d13=3.16 and d23 =2. 06

Height in inches!
If, however, height had been measured in inches. The
ED becomes d12=7.81, d13=12.37 and d23 =(6.32). Please
refer Figure below.
Child 1 is now deemed to be closer to child 2 than child
3 [this is the opposite of the earlier situation when
height is expressed in feet!]
d12=7.81
2
1 d23=6.32
3
d13=12.37
This is the problem.
What to do?
Standardization zik = xik /sk where sk is the standard
deviation of the kth variable.
Distance Measures
Other options include:
1.Sum of squared differences.
2. Replace squared differences by the sum of the

absolute differences of the coordinates.
This approach is known as City Block Approach, but it
has several problems.
Distance Measures
Mahalanobis Distance
The Mahalanobis approach not only performs a

standardization process on data by scaling in terms
of standard deviations but also sums the pooled
within group variance- covariance, which adjusts for
inter correlations among the variables.
In short, Mahalanobis generalised distance procedure

computes a distance measure between objects that is
comparable to the R2 (coefficient of determination) in
regression.
Between Group Similarity
and Distance Measures
In clustering applications, it is also necessary to be
able to define such measures between groups. The
two out of the three problems which arise in finding
suitable definition include:
1.The choice of a summary for each variable to

describe the group. Sensible choice might be
proportions for qualitative variable and means for
quantitative variables.
Between Group Similarity
and Distance Measures
2. Measurement of within- group variation.

3. Construction of measure of similarity or distance
based on the first (1) and making allowance for the
second(2).
Making allowance for within-group variation might be
particularly tricky, if this is not constant from one
group to another, and there is no reason to believe
that it should be.
Distance Measures
Quantitative Variables
A distance measure that geneticists have used
when describing groups, or populations in terms of
gene frequencies is so called genetic distance, dAB
defined as
dAB = (1-cos ) where: cos = (piA piB)
i
The terms piA and piB are the gene frequencies
for the ith allele at a given focus in the two
populations.
The angular transformation for the proportion has a
variance stabilizing role. When several genetic loci are
considered the dAB values are added together.
This approach to measuring distances between groups

can be generalized to qualitative variables merely by
replacing the word locus by variable and allele by
variable category.
If there are p qualitative variables there will be p, d AB

values to add together. As an example, consider the
set of proportions for two hypothetical populations of
red campion .
Example:
Set of proportions for two populations of red campion
Character state Proportions Proportions
Population A Population B
Corolla colour
pink 0.95 0.80
White 0.05 0.20
Coronal scale colour
as petals, pink 0.85 0.75
not as petals, pink 0.01 0.15
not as petals, white 0.14 0.40
Red calyx pigment

Present 0.80 0.60
Absent 0.20 0.40
For corolla colour: dAB = [1-(0.95x0.80) -
(0.05x 0.20) ]
=
0.17
For coronal scale colour:
dAB = {1-(0.85x0.75)1/2- (0.01x 0.15)1/2
-(0.14x0.10)1/2}1/2 =
0.21
For red calyx pigment:
dAB= {1-(0.80x0.60)1/2 (0.20 x0.40)1/2}1/2 = 0.16
The total distance between the two groups is taken as
the sum :
dAB= 0.17+ 0.21+ 0.16 = 0.54
How to Select a Distance Measure?
Use several measures and compare the results to
theoretical or known pattern. Also when the
variables have different units, one should
standardize before running the cluster analysis.
Standardization is particularly advisable when the
range of one variable is much large than of others.
Finally, when the variables are inter-correated,

the Mahalanobis distance measure is the most
appropriate.
Partitioning or Non
Hierarchical Methods
Cluster Analysis: Objective
Group data points into clusters so that the degree of

association is strong between members of the same
cluster (low variance within) and weak between
members of different clusters (high variance
between).
Cluster Analysis: Non Hierarchical
Methods
Clustering refers to identifying the number of
subclasses in a data universe X comprised of n data
samples, and partitioning X into c clusters ( 2 c< n).
C=1, rejection of hypothesis that there are clusters
in the data, where as c=n constitutes the trivial case
where each sample is in a cluster by itself.
Two kinds of c partitions: 1. Hard (or crisp)
2. Soft (or fuzzy).
Issues of Importance in Cluster Analysis
Issue 1
How to measure (mathematical) similarity or
similarity between pairs of observations
-A Simple method could be..
Distance between pairs of feature vectors in a
Feature space. Then we expect that the distance
between points in the same cluster will be
considerably less than the distance between points
in different clusters.
Issues of Importance in Cluster Analysis
Issue 2
How to evaluate the partitions (cluster) once they
are formed?--This could be termed as cluster
validity.
In this case, it is necessary to identify the value of
cluster c that gives the most plausible number of
clusters in the data for the analysis at hand.
Nonhierarchical Methods
A single-pass method is one in which the partition is

created by a single pass through the data set or, if
randomly accessed, in which each compound is
examined only once to decide which cluster it should
be assigned to.
A relocation method is one in which compounds are

moved from one cluster to another to try to
improve on the initial estimation of the clusters
Non hierarchical Methods
The relocating is typically accomplished based on
improving a cost function describing the goodnessof
each resultant cluster.
The nearest neighbor approach is more compound

centered as against the other non hierarchical methods
In it, the environment around each compound is
examined in terms of its most similar neighboring
compounds, with commonality between nearest
neighbors being used as a criterion for cluster
formation.
Clustering Method Proposed by
Professor James Bezdek
We define optimum partitions through a global criterion
function that measures the extent to which the candidate
partitions optimize a weighted sum of squared errors
between data points and cluster centers in feature space.
Many other clustering algorithms are available.

However, the method of clustering must be closely
matched with the particular data under study for the
successful interpretation of the substructure in the data.
C - Means Clustering
Bezdek ( 1981) developed an extremely powerful

classification method to accommodate fuzzy data. It
is an extension of the method c- means, or hard c-
means, when employed in a crisp classification.
For n data samples:

X = { x1 , x2 , x3 , . xn }; .1
Each data sample, xi , defined by m features, i.e.,
xi= { xi1, xi2, xi3, x i} 2
Where each xi in X is an m-dimensional vector of m
elements or m features. Normalize before
classification ( if need be ) the m feature elements
as these could have different units. In a
geometric sense, each xi is a point in m-
dimensional feature space, and the universe of
the data sample, X , is a point set with n
elements in the sample space.
Bezdek [1981] suggested using an objective

approach function for clustering the data into
hyper-spherical clusters. This idea for hard
clustering is as follows:
Each cluster data is a hyper- spherical shape with
a hypothetical geometric cluster center.
z
V2
V1
o
x
Objective function is developed for the two

y purposes simultaneously: first, minimize the
Euclidean Distance (ED) between each data set
and its cluster center, and second, maximize
the ED between cluster centers.
Hard c - Means (HCM)
HCM is used to classify data in a crisp sense
(each data point will be assigned one and only one,
data cluster). In this sense, clusters are also called
partitions- that is, partitions of the data.
Define a family of set { Ai , i=1,2,3,,c} as hard c-

partition of X, where the following set theoretic
forms apply to these partitions:
c
U Ai = X
i=1
Ai Aj = all i j 3
Aj X all i 4
Where X= {x1, x2, x3,xn} is a finite set space of the
universe of data samples, and c is the number of
classes, or partitions ,or clusters , into which we want
to classify the data.
Suppose we have the case where c=2, following
are the set expressions:
A2 = A1 A1 A1 = X and A1 A1 =
These set expressions are equivalent to the

excluded middle laws.
The function theoretic expressions associated

with Eqs. 3, 4, and 5 are:
c
Ai ( xk) = 1 for all k ..7
i=1
Ai ( xk) Aj ( xk) = 0 for all k 8
n
0 < Ai ( xk) < n for all i .9
k=1
Where characteristic function Ai (xk) is defined once again
as
Ai (xk) = 1 x k Ai
0 xk A i . 10
Explanation:
Any sample xk can only and definitely belong to one
of the c classes ( eqs. 7 &8) while eq. 9 implies
that no class is empty and no class is the whole set
X, ( that is the universe).
jth data point in the ith cluster or class , is defined to
be ij = Ai (xj). We now define U matrix comprising
of elements ij ( i=1,2,.c; j=1,2,n): that is, c rows and
n columns. We define a hard c partition space for X as
the following Matrix set:
c n
Mc = U | ij {0,1}, ik =1, 0< ik <n 11

i=1 i=1
Cardinality of any hard c-partition, Mc, is

c
Mc = ( 1/c!) c (-1)c-1 ,
in 12
i=1 i
Example
X ={x1, x2, x3, x4, x5} that is n=5 and suppose we
want to cluster these point into two clusters (c) = 2
Mc = {2(-1) + 25 } = 15
some of the 15 possible hard 2-partition are :
11110 11100 11100 00001

00001 00011 00011 11110
and so on.
Notice that the following two matrices
11110 00001 c1
00001 and 1 1 1 1 0 c2 of the first MU
are not different clustering 2- partitions.
In fact, they are the same 2-partition irrespective
of an arbitrary row swap (relabeling each row of U
matrix as C2 and C1). The cardinality measure gives
the number of unique c-partitions for n data points.
An interesting question now arises : Of all the
possible c-partitions for n data samples, how can we
select the most reasonable cpartition for the
partition space Mc?
Which of the 15 possible hard c-partitions for 5 data
points and 2 classes is the best? The answer is: Use
Objective Function (or classification criteria ) to be
used to classify or cluster the data.
Hard c-means the algorithm is known as a within-class
sum of squared errors approach using a Euclidean norm
to characterize distance, denoted as J (U,v), where U is
the partition matrix, and the parameter, v, is a vector
of cluster centers.
Objective Function
n c
J (U,v) = ik (dik)2 ..
13
k=1 i=1
Where dik is a ED measure (in m dimensional feature
Space, Rm ) between the kth data sample xk and ith
cluster center vi, given by
m=1
dik= d ( xk-vi) = xk vi = ( xkj vij)2
j=1
14
Since each data sample requires m coordinates to
describe its location in Rm-space , each cluster center
also requires m coordinates to describe its location in
this same place. Therefore, the ith cluster is a vector
of length m,
Vi ={ vi1, vi2, vim} 15
where the jth coordinate is calculated by
vij = n ik .xkj / n ik . 16
k=1 k=1
We seek the optimum partition,U*, to be the

partition that produces the minimum value for the
function, J.That is,
J(U*, v*) = min J (U,v) 17
U M
Finding U* is exceedingly difficult as Mc . For

n=25, c=10 Mc 10 18! Fortunately useful and
effective alternative search algorithm has been
devised by [Bezdek, 1981]
Iterative Optimization [Bezdek, 1981]
Iterative optimization is basically like many other
iterative methods in that we start with an initial guess
at the U matrix.
The stepwise procedure is as follows:
1. Fix c (2 c < n) and initialize the U matrix;

U(0) Mc , Then do r =1, 2,
2. Calculate the c center vectors:
{ vi(r) with U(r) }

3. Update U(r) ; calculate the update
characteristic function ( for all i,k):
(r+1) 1 dik(r) =min{djk (r) for all j c

ik = 0 otherwise 18
4. If mod[ U(r+1)- U ( r)] ( tolerance level) ..19

STOP : Otherwise set r =r+1 and return to step 2.
In step 4 the notation is any matrix norm such as the
Euclidean norm.
Example: Performance of
Catalytic Converter
In a chemical engineering process involving an
automobiles catalytic converter ( which converts
CO to CO2) , we have a relationship between
conversion efficiency of the catalytic converter
and the inverse of the temperature of the
catalyst. Two classes of the data are known from
reaction efficiency. Points of high conversion
efficiency and high temperature are indicators of
a no polluting system ( class c1) and points of low
conversion efficiency and low temperature are
indicative of a polluting system ( class c2).
Suppose we measure the conversion efficiency say
( ) and temperature (T) of four different
catalytic converters and attempt to characterize
them in polluting or non polluting.
The four data points ( n=4) are shown in the figure
where the y axis is conversion efficiency and the x
axis is the inverse of the temperature [in a
conversion process like this the exact solution
takes the form in(1/T)]. The data are described by
two features( m=2) and have the following
coordinates in 2D space.
x1= {1, 3}, x2 = {1.5, 3.2},
x3= {1.3, 2.8}, x4 = { 3, 1}
y
C2
3.2
C1
3.0
2.8
U(0)
1.0
x
1 1.3 1.5 2 3
Four Data Points in two- dimensional feature space
To classify these data points into two classes (c=2).It
is desirable to compute cardinality of the possible
number of crisp partitions for the system I.e. to find
Mc .
Mc = ! 2 (-1)1 (1)4 + 2 (-1)0 (2)4

1 2
= [-2 +16] =7 which says that there are seven

ways ( irrespective of row swamps) to classify the
four points into two clusters. Let the initial guess of
the crisp partition,U by assuming x1 to be the class 1
and x2, x3, x4 to be class 2
U(0) = 1 0 0 0
0 1 1 1
We seek optimal partition U* from the seven
partitions. With desired tolerance or convergence
level, : U(0), U(1) U(2). U*
For class 1 we calculate the coordinates of the
cluster center,
V1j = 1x1j + 2x2 j+ 3x3j + 4x4j
(1+2+3+4)
= (1) x1j +(0) x2 j + (0) x3j +(0) x4j
(1+0+0+0)
and vi ={vi1, vi2,.vim}
For m =2, which means we deal with two coordinates
for each data point: vi ={vi1,vi2 }
where for c =1 (class 1) v1= {v11, v12 }; similarly for c
= 2 ( class 2) v2 ={v21, v22 }
Therefore, using expression for vij for c=1, and j =1,
respectively,
V11 =1(1)/1 = 1 x coordinate
V12 = 1(3)/1 = 3 y coordinate v1 ={1,3)
which just happens to be the co ordinates of point x 1,
since this is the only point in the class for the
assumed initial partition, U(0).
For c= 2, we get the cluster center coordinates
V2j = (0) x1j + (1)x2 j+ (1) x3j + (1) x4j
0 + 1 + 1+1
= ( x2 j+ x3j + x4j ) / 3
For c =2 and j =1 and 2, respectively,

V21 = 1(1.5) + 1( 1.3) +1(3)/3 =1.93 x co-ord.
V22 = 1(3.2) + 1(2.8) + 1 (1)/3 =2.33 y co-ord.
v2 = {1.93, 2.33}
We compute the values for dik, or the distance
from the sample xk ( a data set) to the center,vi, of
the ith class. Using eq.14
m 1/2
dik = j=1 (xkj vij) 2
For c =1; d1k =[xk1-v11)2 +(xk2 v12)] .

Compute for each data set k=1 to 4, computation of
dik as follows:
d11 = (1-1)2 + (3-3)2 = 0
d12 = (1.5-1)2 +(3.2-3)2 = 0.54
d13 = (1.3-1)2 + (2.8-3)2 = 0.36
d14 = (3-1)2 +(1-3)2 = 2.83
and for cluster 2:

d21 = (1-1.93)2 + (3- 2.33)2 = 1.14
d22 = (1.5-1.93)2 +(3.2-2.33)2 = 0.97
d23 = (1.3-1.93)2 + (2.8-2.33)2 = 0.78
d24 = (3-1.93)2 +(1-2.33)2 = 1.70
Update the partition to U(1) for each data point( for [c-
1] clusters) using eq. 18. Hence, for class 1 we compare
dik against the minimum of {dik,d2k):
For k=1, 2, 3 and 4, respectively dik
d11=0.0, min(d11,d21)=min(0,1.14)=0.0, thus 11 =1
d12=.54, min(d12,d22)=min(.54,.97)=.54,thus 12 =1
d13=.36, min(d13,d23)=min(.36,.78)=.36,thus 13 =1
d14=2.83,m(d14,d24)=min(2.83,1.70)=1.70, 14 =0
Therefore, the updated partition is
U(0) = 1 1 1 0
0 0 0 1
Since the updated partitions U(0) and U(1) are different ,
we repeat the same procedure based on the new setup
of two classes. For c =1, the center coordinates are
v1j or vj =x1j +x2j + x3j/(1+1+1+0) since x14= 0
v11=x11 +x21 +x31 /3 =1 +1.5 +1.3/3 =1.26
v12=x12 +x22 +x32 /3 =3 +3.2 +2.8/3 = 3.0
v1 ={1.26, 3.0}
and for c =2, the center coordinates are
v2j = or vj = x4j/( 0+0+0+1), since x21 =x23 =0
v21 =3/1 = 3
v22=1/1 = 1 v2 = {3, 1}
Now, we calculate distance measure again:

d11 = (1-1.26)2 + (3- 3)2 = 0.26
d12 = (1.5-1.26)2 +(3.2-3)2 = 0.31
d13 = (1.3-1.26)2 +(2.8-2.33)2 = 0.20
d14 = (3-1.26)2 + (1- 3)2 = 2.65
d21 = (1- 3)2 + (3 - 1)2 = 2.83
d22 = (1.5- 3)2 +(3.2- 1)2 = 2.66
d23 = (1.3- 3)2 +(2.8- 1)2 = 2.47
d24 = (3 - 3)2 + (1 - 1)2 = 0.0
and again update the partition U(1) to U(2) :
d11=.26, min(d11,d21)=min(.26,2.83)=0.26, 11 =1
d12=.31, min(d12,d22)=min(.31, 2.66)=0.31, 12 =1
d13=.20, min(d13,d23)=min(.20, 2.47)=0.20, 13 =1
d14=2.65,m(d14,d24)=min(2.65,0) =0.0, 14 =0
Because the partition U(1) and U (2) are identical, we
could say the iterative process has converged;
therefore, the optimum hard partition ( crisp) is
U(*) = 1 1 1 0
0 0 0 1
The optimum partition tells us that for the
catalytic converter example, the data points x1, x2
and x3 are more indicative of a non polluting
converter than is data point x4.
y
Final Two Clusters
3.2
C1
3.0
2.8
U(Final)
C2
1.0
x
1 1.3 1.5 2 3
Four Data Points in two- dimensional feature space
HCM is also known as k
means Clustering)
What Next?
A Primer
Classical Logic and Fuzzy

logic
Professor Ashok
Deshpande
Fuzzy c Means (FCM)
Why Fuzzy C- Means?

Three Examples
Problem 1
Iterative Optimization
To which class this point belong?
Butterfly Classification Problem

( Bezdek, 1981)
Butterfly Classification Example 1
A good example of the iterative optimization method
is provided with the butterfly problem. We have 15
data points and one of them is on a vertical line of
the symmetry (the point in the middle class of the
data cluster). If c =2 classes and the point in the
right of the line of symmetry should be in the other
class, the problem lies in assigning the point on the
line of symmetry to a class.
To which class this point belong? Whichever class
the algorithm assign this point to, there will be a
good argument that it should be a member of the
other class.
Alternatively, the argument may revolve around the

fact that the choice of two classes is a poor one for
this problem.Three classes might be the best
choice, but the physics underlying the data might be
binary and two classes may be the only option.
In conducting the iterative optimization approach
we have to assume an initial U matrix. This matrix
will have two rows ( two classes, c =2) and 16
columns ( 15 data points, n =15). It is important to
understand that the classes may be unlabeled in this
process.
That is, we can look at the structure of the data

without the need for the assignment of labels to
the classes.This is often the case when one is first
looking at a group of data. After several iterations
with the data, and as we become more and more
knowledgeable about the data, we can then assign
labels to the classes.We start the solution with the
assumption that the point in the middle is assigned
to the class represented by the bottom row of the
initial U matrix, U(0)
1 11111000000000
U(0) = 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
After four iterations ( Bezdek, 1981) this method

converges to within a tolerance level of =0.01 as
U(4)= 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
The point in the line of symmetry has full

membership in the second class and no membership in
the first class.
We can see from the shape that that the eighth point
should share membership with each class. This is not
possible with crisp classification; membership is binary-
a point is either a member of the class or it is not.
Example 2
Looking at the picture, on X axis, we may identify two
clusters in proximity of the two data concentrations.
We will refer to them using A and B. In the first
approach; the k-means algorithm - we associated each
datum to a specific centroid; therefore, this
membership function looked like this:
Fuzzy c-Means
In the FCM approach, instead, the same given datum
does not belong exclusively to a well defined cluster,
but it can be placed in a middle way. In this case, the
membership function follows a smoother line to
indicate that every datum may belong to several
clusters with different values of the membership
coefficient.
Fuzzy c-Means
In the figure above, the datum shown as a marked

spot( ) belongs more to the B cluster rather than
the A cluster. The value 0.2 of m indicates the
degree of membership to A for such datum. Now,
instead of using a graphical representation, we
introduce a matrix U whose factors are the ones
taken from the membership functions:
1 0 0.8 0.2
0 1
U N Xc UNxC 0.3 0.7
1 0 0.6 0.4
0 1 0.9 0.1
c1 c2 c1 c2
Example-3
Suppose you are a fruit geneticist interested in
genetic relationships among fruits. In particular, you
know that a tangelo is a cross between a grapefruit
and a tangerine. You describe the fruit with such
features as colour, weight, sphericity, sugar content,
skin texture and so on. Hence, your feature space
could be highly dimensional.
Suppose you have three fruits ( three data points)

X ={x1= grapefruit, x2= tangelo,x3= tangerine } and
are described as m features.
Continued
Let us classify the three fruits into two classes (as

per the crisp classification) to decide the genetic
assignment for the three fruits. Cardinality for this
case where n=3 and c=2 is Mc =3. Arrange U matrix as
follows:
x1 x2 x3
U = c1 1 0 0
c2 0 1 1
The three possible partition

matrix are:
x1= grapefruit, x2= tangelo, x 3=
tangerine
C1 1 0 0 1 1 0 1 0 1
C2 0 1 1 0 0 1 0 1 0
First crisp partition uncomfortable segregation:
grapefruit is in one class and the tangelo and tangerine
in the other; indicating that the two share nothing in
common with the grapefruit!
Second partition: Grapefruit and the tangelo in one
class, suggesting that they share nothing common with
tangerine!
Finally, the third partition is the most genetically
discomforting of all, because here the tangelo is in a
class by itself, sharing nothing in common with the
progenitors! One of the three partition will be the final
answer. Which one is the best? The answer is NONE.
What to do??? Is this a fuzzy case? YES.
In fuzzy case. this segregation and genetic absurdity is
not a problem. We can have most intuitive situation
where the tangelo shares membership with both classes
with the parents. The following partition might be a
typical outcome for the fruit genetics problem:
x1( grape) x2 (Tangelo) x3( tangerine)
1 0.91 0.58 0.13
U= 2 0.09 0.42 0.87
We can show that the sum of each row is a number between 0 and n:
0 <n1k =1.62 and n2k =1.38, and that is less than 3.
11 ^ 21 = min ( 0.91, 0.09) = 0.09 0
12 ^ 22 = min ( 0.58, 0.42) = 0.42 0 and 13 ^ 23= 0.13 0
How??
Fuzzy c-Means
Algorithm
Fuzzy c-Means
The Fuzzy c- Means Algorithm (FCM) generalizes

the hard c- means algorithm to allow a point
partially belong to multiple clusters. Therefore, it
produces a soft partition for a given data set. The
objective function J1 of hard c- means is extended in
two ways.
Firstly, The fuzzy membership degrees in clusters

were incorporated into the formulation,
Fuzzy c-Means
and secondly, an additional parameter m was

introduced as a weight exponent in the fuzzy
membership.
The extended objective function, denoted Jm, is

k
Jm (P,V)= { ci (xk)} m || xk- j || 2
i=1 xk x [1]
Where P is a fuzzy partition of the dataset X formed

by C1,C2,.Ck. The parameter m is a weight that
determines the degree to which partial members of a
cluster affect the clustering result.
Fuzzy c-Means
The algorithm also tries to find a good partition by

searching for prototypes that minimizes the
objective function. The fuzzy c- means also needs to
search for membership function c, that minimizes
J m.
To accomplish these two objectives, a necessary
condition for local minimum of Jm was derived from
Jm. This condition, which is formally stated, serves
as the foundation of the fuzzy c-means algorithm.
Fuzzy c -Means Theorem (A
Summary)
A constrained fuzzy partition {C1,C2,..C4} can be
a local minimum of the objective function Jm only
if the following conditions are satisfied:
1
___________________________
c i (x) = k
{|| x- vi || 2/ || x vj || 2}1/m-1
j=1
1 i k, x X [2]
k
[ C i (x) ] m
x
x X
Vi = ________________________________
k
[ C i (x) ] m
x X
1 i
k
[3]
Based on this theorem, FCM updates the
prototypes and the MF iteratively using equation [2]
and [3] until a convergence criterion is reached .
The Fuzzy c-means (FCM) Algorithm
FCM {X, C, m, }
X: an unlabeled data set ; C: the number of clusters
to form ; m is the parameter in the objective
function , and is threshold for the convergence
criteria
Initiate prototype V={v1,v2,., vc}
Repeat V Previous V; Compute MF using equation 3),
update prototype ,vi in V using equation 2
c
until || vi Previous vi ||
i=1
Problem on Fuzzy c-Means
Suppose we are given a data f1 f2

set of six points, each of
X1 2 12
which has two features f1 and
f2 . Assuming that we want to X2 4 9
use FCM to partition the data
X3 7 13
into two clusters (c=2).
Suppose we set the X4 11 5
parameter m at 2, and the
X5 12 7
initial prototypes to
v1=(5,5) v2= (10,10). X6 14 4
f2 An Example of Fuzzy c-Means Algorithm
15 (7,13)
(2,12) x3
x1
Feature Distance (f2)
10 + +v2 (12,7)
(4,9)
x2 + x5
(10,10)
(11,5)
(14,4)
5 +v1 (5,5) x4 x6
0 f1
5 10 15
Feature Distance (f1)

f2 Example of Fuzzy c-Means Algorithm
15
V1 [ [6.6273,9.1484]
x3
X1 [2,12]
2 +v2
8 V2 [9.7374,8.4887]
10 + [10,10]
x2 [4,9] x5
7
5 +v1 [5,5] x4 x6
3
0 f1
5 10 15
The initial Membership function of the two clusters

are calculated using equation 2;
ci (x) _________1____________
2
{|| xi-vj || / || xi-vj ||}2

j=1
||x1-v1 || 2 =ll 2-5ll2+ ll12-5ll2=32+72 = 58

Similarly,
||x1 -v2|| 2 = 82+22 = 68
c1 (x1) = 1/ { 58/58+ 58/68} = 0.5397
c2 (x1) = 1/ { 68/58+ 68/68} = 0.4603
Similarly, we obtain the following:
C 1 (x2) = 1/ { 17/17+ 17/17} = 0.6852
C 2 (x2) = 1/ { 37/17+ 37/37} = 0.3148
C 1 (x3) = 1/ { 68/68+ 68/18} = 0.2093
C 2 (x3) = 1/ { 18/58+ 18/18} = 0.7907
C 1 (x4) = 1/ { 36/36+ 36/36} = 0.4194
C 2 (x4) = 1/ { 26/36+ 26/26} = 0.5806

C 1 (x5) = 1/ { 53/53+ 53/13} = 0.1970
C 2 (x5) = 1/ { 13/53+ 13/13} = 0.8030
C 1 (x6) = 1/ { 82/82+ 82/52} = 0.3881
C 2 (x6) = 1/ { 52/82+ 52/52} = 0.6119
Therefore, using these initial prototypes of the

clusters, the MF indicated that x1 and x2 are more in
the first cluster, while the remaining points in the
dataset are in the second cluster.
The FCM algorithm then updates the prototypes

according to equation 3;
6
{ c1 (xk)}2 x x k
k=1
V1 = ________________________
6
{ c1 (xk)}2
k=1
(0.5397)2x(2,12)+(0.6852)2x(4,9) +(0.2093)2 x (7,13)

+(0.4194)2x (11,5) +(0.1970 )2x(12,7)+ (0.3881)2x (14,4)
=_________________________________________
(0.5397)2+(0.6852)2+(0.2093)2 +(0.4194)2+(0.1970 )2+ (0.3881)2
={(7.2761/1.0979), (10.044/1.0979)}
=(6.6273,9.1484)
6
{ c2 (xk)}2 x x k
k=1
V2= ________________________
6
{ c1 (xk)}2
k=1
(0.4609)2x(2,12) +(0.3148)2x(4,9)+(0.7909)2x(7,13)+
0.5806)2x (11,5) +0.803)2x(12,7)+(0.6119)2x(14,4)
_________________________________________
=(0.4609)2+(0.3148)2+(0.7909)2+(0.5806)2+0.803)2
+(0.6119)2
={22.326/2.2928, 19.4629/2.2928}
= (9.7374, 8.4887)
Example of Fuzzy c-Means Algorithm
f2
15
V1 [ [6.6273,9.1484]
x3
X1 [2,12]
2 +v2
8 V2 [9.7374,8.4887]
10 + [10,10]
x2 [4,9] x5
7
5 +v1 [5,5] x4 x6
3
0 f1
5 10 15
Fuzzy C Means :Example
5 x1
x2
4
x3
1.5
0.5 x4
0 1 1.5 2
Fuzzy data points to be clustered using CM

Fuzzy C Means :Example
Consider the following data points.x1 =(1, 5); x2 = ( 2.4);
x3 = ( 1, 1.5) and x4 = 1.5, 0.5).
Using weighting factor of m = 2 and L=0.05;
To start with initial partition
U0 = 1 0 1 0
0 1 0 1
Each of the cluster coordinates for each calss can be
defined by
n n
Vij = ik xkj / ik m where j= 1,2,m in m dimensional space.
k=1 k=1
For c=1; V11 = 1(1) + 1 (1) /2 =1
V12 = 5(1) + 1.5 (1) /2 =3.25
Therefore V1 = (1, 3.25)
For c=2; V21 = 2(1) + 1.5 (1) /2 =1

V22 = 4(1) + 0.5 (1) /2 =2.25
Therefore V2 = (1.75, 2.25)
The distance of each data point from each cluster center

are computed using the following expression
dik = { (Xkj - Vij ) 2

}
d11 = (1-1.)2 + (5- 3.25)2 = 1.75
d12 = (2-1)2 +(4-3.25)2 = 1.25
d13 = (1-1)2 +(1.5-3.25)2 = 1.75
d14 = (1.5-1)2 + (0.5- 3.25)2 = 2.795
d21 = (1- 1.75)2 + (5 - 2.25)2 = 2.85

d22 = (2- 1.75)2 +(4- 2.25)2 = 1.76
d23 = (1- 1.75)2 +(1.5- 2.25)2 = 1.06
d24 = (1.5 - 3)2 + (0.5 2.25)2 = 1.76
With the distance measure, we can update U using
( r+1) c (r) (r)

ik = { [dik / djk ] 2
}
j=1
For i=1 and we get:
11 = [ ( d11/d11 )2 + ( d11/d21 ) 2 ]-1 =[1+ (1.75/2.85)2]-1 = 0.726
12 = [ ( d12/d12 )2 + ( d12/d22 ) 2 ]-1 =[1+ (1.25/1.65)2] -1= 0.6647
13 = [ ( d13/d13 )2 + ( d13/d23 ) 2 ]-1 =[1+ (1.75/1.06)2] -1 = 0.268
14 = [ ( d14/d14 )2 + ( d14/d24 ) 2 ]-1 =[1+ (2.795/1.76)2] -1= 0.726

Since ik = 1
1
U = 0.775 0.668 0.268 0.284
0.2738 0.3353 0.732 0.716
(1) (0)
ik - ik = 0.27 .0.05
For next iteration, we proceed by calculating the

cluster centers but from the new partition U1 for
c=1
Fuzzy C Mean Clustering
A Commentary:
Limitations and Research Efforts
Limitations of FCM
The hard clustering panel the dataset into clusters
such that one object exactly belongs to only one
cluster. This procedure is unsuitable for real world
dataset in which the elements are not strict and
there are no distinct boundaries between the
clusters.
After, the fuzzy set theory was introduced by Lotfi.

A. Zadeh, the researchers incorporate the concept of
fuzzy with clustering techniques to deal the
ambiguity of data. So in this work we focus on fuzzy
clustering of unsupervised clustering techniques.
Limitations of FCM
In unsupervised clustering, there is no a priori

information on the given data distribution. The
objective of unsupervised fuzzy clustering is to
allocate each data point to all different clusters with
some degrees of membership. The membership of a
data element is thus shared the characters of
different clusters and it depends on the proximity of
the data elements to the cluster centres.
Limitations of FCM
The fuzzy c-means (FCM) is the significant algorithm

in fuzzy clustering which was proposed by Dunn and
generalized by Bezdek and now it is effectively used
in wide variety of real world problems.
The FCM method has still some shortcoming such as

its require for a large amount of time to converge and
it is highly sensitive to the noise and outliers in the
data, because of squared-norm to compute similarity
between cluster center and data elements. To resolve
these problems, many researchers developed new
modified fuzzy c-means for data analysis, particularly
for image segmentation .
Variants of FCM
The modified entropy based FCM method has

been developed by Cheng and Wei for data
clustering. The modified FCM which includes the
spatial information into the membership function
for clustering was proposed by Chuang et al to
decrease the spurious blobs and remove the
noisy spots in the images. An adaptive weighted
averaging FCM (AWA-FCM) in which the spatial
control of the neighboring pixels on the central
pixel is incorporated, was developed by Jiayin
Kang et al .
Variants of FCM
Though the above algorithms provide good

outcome in clustering the large dataset and image
segmentation, their performance deteriorates
rapidly when the noise level is amplified.
Since the Euclidean distance is used to calculate
dissimilarity in all modified FCM, it fails to deal the
general shaped dataset. And also, the
computational cost is high for large datasets and
fails to rectify the heavy noises and outliers when
using the above listed FCMs.
The limitations of recent fuzzy c means algorithms
in data clustering and medical image segmentation,
could possibly be taken care of by the method
proposed by Dr. S.R. Kannan ( he calls me uncle).
Variants of FCM
Dr. Kannan with his group, presents three new

effective fuzzy c means methods based on hyper-
tangent distance, effective additional and penalty
terms for clustering the complex data structure and
breast medical images.
The effectual objective functions of proposed FCMs
are essentially developed for improving the robustness
of getting meaningful clusters and robustness to noise
and outliers, desirable memberships, and improve the
similarity measurement in real world datasets.
The research is in progress at his end. I wish him a

great success.

Only HCM and FCM Final

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Only HCM and FCM Final

Uploaded by

Copyright:

Available Formats

Hard C Mean and Fuzzy

From causes which appear similar, we

There is a structure in nature. Much of this structure

Natural sphericity or rain drops and bubbles: why do

Some patterns, such as helical appearance of DNA or

Consider the geometry and colourful patterns of a

Answers to some of these questions are still

Just as there is structure in nature, we believe

For fields dealing with diagnosis, we often seek to

Cluster analysis is referred as:

Group data points into clusters so that the degree of

HR and NR Methods such as :

2. Fuzzy logic Based Methods:

3. Based on Bioinformatics Concepts:

UPGMA, Transformed distance (TD), Neighbour

Similarity Measures or relations, and Distance

We will discuss one of the methods of similarity

How should inter object similarity be measured?

-Closeness or proximity between each pair of objects.

Gowers similarity measure or Gowers coefficient

For categorical data the component similarities,

Weight Anxiety Depression Halluci Age Group

S13 =1x {1-|120-110 | /40} +1x0+1x0+1x0+1x0 =0.150

S23 =1x {1-|150-110 | /40} +1x0+0x1+1x0+1x0 =0.000

S41 =1x {1-|145-120 | /40} +0x1+0x1+1x0+1x0 = 0.125

S51 =1x {1-|120-120 | /40} +0x1+0x1+1x0+0x1 = 0.500

1 1 0.0625 0.150 0.125 0.500

0.0625 1 0 0.175 .050

0.150 0 1 0.042 .187

0.125 0.175 0.042 1 .187

Dissimilarity is the complement of similarity

Distance measures are most commonly used

The most commonly used measure of similarity (or

Perhaps the most commonly used distance measure

ED are: d12=5.02, d13=3.16 and d23 =2. 06

This is the problem.

1.Sum of squared differences.

2. Replace squared differences by the sum of the

The Mahalanobis approach not only performs a

In short, Mahalanobis generalised distance procedure

1.The choice of a summary for each variable to

2. Measurement of within- group variation.

This approach to measuring distances between groups

If there are p qualitative variables there will be p, d AB

Red calyx pigment

Finally, when the variables are inter-correated,

Group data points into clusters so that the degree of

A single-pass method is one in which the partition is

A relocation method is one in which compounds are

The nearest neighbor approach is more compound

Many other clustering algorithms are available.

Bezdek ( 1981) developed an extremely powerful

For n data samples:

Bezdek [1981] suggested using an objective

Objective function is developed for the two

Define a family of set { Ai , i=1,2,3,,c} as hard c-

These set expressions are equivalent to the

The function theoretic expressions associated

Mc = U | ij {0,1}, ik =1, 0< ik <n 11

Cardinality of any hard c-partition, Mc, is

some of the 15 possible hard 2-partition are :

11110 11100 11100 00001

We seek the optimum partition,U*, to be the

Finding U* is exceedingly difficult as Mc . For