You are on page 1of 120

Hard C Mean and Fuzzy

C Mean Clustering
Professor Ashok Deshpande PhD(Engineering)
Founding Chair: Berkeley initiative in Soft Computing
(BISC)-Special Interest Group (SIG)-Environment
Management Systems (EMS)
Guest Faculty: University of California Berkeley Ca USA
Visiting Professor: University of New South Wales Canberra
Australia and Indian Institute of Technology Mumbai
Adjunct Professor: NIT Silchar and College of Engineering Pune
Former Deputy Director: National Environmental Engineering
research Institute (NEERI) India
So far as the laws of mathematics
refer to reality they are not
certain. And so far as they are
certain, they do not refer to
reality.
Albert Einstein
Theoretical Physicist and Nobel Laureate,
Geometrie und Erfahrung, Lecture tp Prussian Academy,1921
Classification and Clustering

From causes which appear similar, we


expect similar effects. This is the sum
total of all our experimental calculations.

David Hume
Scottish philosopher
Equity Concerning Human Understanding,1748
A Commentary on Clustering
and Classification

There is a structure in nature. Much of this structure


is known to us and is quite beautiful.

Natural sphericity or rain drops and bubbles: why do


balloons take this shape?

Some patterns, such as helical appearance of DNA or


the cylindrical shape of some microscopes.

Consider the geometry and colourful patterns of a


butterflys wings: why do these patterns exist in our
physical world?
A Commentary on Clustering
and Classification

Answers to some of these questions are still


unknown; many others have been discovered through
increased understanding in physics, chemistry and
biology.

Just as there is structure in nature, we believe


there is an underlying structure in the most of the
phenomena we wish to understand:

Examples
Image recognition
Molecular biology application such as protein folding
and 3D molecular structure.
Oil exploration
Cancer detection and, so on.
A Commentary on Clustering
and Classification

For fields dealing with diagnosis, we often seek to


find structures in the data obtained from the
observations. (visual, audio or perception based
etc.)
Finding the structure in the data is the essence
of classification.
By finding structures we are classifying the data
according to similar patterns, attributes, features
and other characteristics.
In classifications, also termed as clustering, the
most important issue is deciding what criteria
against to classify against.
What is Cluster Analysis?
Cluster analysis is the name of a group of multivariate
technique whose primary purpose is to identify similar
entities from the characteristic they posses.

Cluster analysis is referred as:


Q-analysis
Typology
Classification analysis
Numerical taxonomy
Cluster Analysis: Objective

Group data points into clusters so that the degree of


association is strong between members of the same
cluster (low variance within) and weak between
members of different clusters (high variance
between).
Where Do We Use Cluster Analysis ?

Sequence Alignment
Phylogenetic Tree Building
Ecological Data Analysis
Grouping patients into disease areas
Physiology and Anatomy
Taxonomy
Chemistry and so on.
Basic Methods
Clustering Formalism
-Three Types
1. Based on Classical methods :

HR and NR Methods such as :


Nearest /farthest neighbour hood or Single
linkage or complete linkage, Hard c Means (or K
Mean Clustering)

2. Fuzzy logic Based Methods:


Fuzzy Equivalence Relations
Fuzzy c means
Clustering Formalism
-Three Types

3. Based on Bioinformatics Concepts:

UPGMA, Transformed distance (TD), Neighbour


Relations (NR), Neighbour Joining (NJ),
Maximum Parsimony (MP), Maximum Likelihood
Estimator (MLE) , Hidden Markov Models (HMM),
Monte Carlo-Markov Chain (MCMC) and many more
Similarity Measures

Similarity Measures or relations, and Distance


measures are the two important facets of any
classification or cluster analysis. Some of the other
aspects include: feature selection, number of classes/
patterns, and alike.

We will discuss one of the methods of similarity


relations before HCM and FCM.
Partitioning

How should inter object similarity be measured?

Similarity measures:

-Closeness or proximity between each pair of objects.


-Distance or the difference between the pairs of
objects.
Since distance is the complement of similarity, this
approach can be used to assess similarity.
Similarity Measures :Variables
of Mixed Type

Gowers similarity measure or Gowers coefficient


p p
sij = wijk sijk/ wijk ,
k=1 k=1
where sijk is the similarity between ith and jth
individuals as measures by the kth variable and wijk
is typically 1 or 0 depending on whether or not the
comparison is considered valid for the kth variable.
Weights zero are assigned when variable k is
unknown for one or both individuals, or to binary
variables where it is required to exclude negative
matches.
Similarity Measures
For variables of Mixed Type

For categorical data the component similarities,


sijk take the values one when the two individuals
have the same value and zero otherwise. For
quantitative variables they are given by
Sijk = 1- xik-xjk /Rk

Where: xik and xjk are the two individuals values for
variable k, and Rk is the range of variable usually in
the set of individuals to be considered.
Data: Five Psychiatrically Ill Patients

Weight Anxiety Depression Halluci Age Group


(lbs) level present? nation
presen
t?
Patient 120 Mild No No Young
1
Patient 150 Moderat Yes No Middle
2 e
Patient 110 Severe Yes Yes old
3
Patient 145 Mild No Yes Old
4
Patient 120 Mild No Yes Young
5
The investigator wishes to exclude negative matches on
depression and hallucinations from the calculation of
between patient similarity.
Gowers similarity measure for each patient:
S12 ={1x (1-30/40) + 1x 0 + 1x 0 + 0x1 + 1x 0 } / [ 1+1
+1 +0 +1 ] =0.0625
S23={1x(1-40/40) + (1x0)+(0x1) +(1x0) +(1x0)}/
1+1+0+1+1 =0

S13 =1x {1-|120-110 | /40} +1x0+1x0+1x0+1x0 =0.150


S14 =1x {1-|120-145 | /40} +0x1+0x1+1x0+1x0 =0.125
S15 =1x {1-|120-120 | /40} +0x1+0x1+1x0+0x1 =0.500

S23 =1x {1-|150-110 | /40} +1x0+0x1+1x0+1x0 =0.000


S24 =1x {1-|150-110 | /40} +1x0+1x0+1x0+1x0 =0.175
S25 =1x {1-|150-120 | /40} +1x0+1x0+1x0+1x0 =0.050
S31 =1x {1-|110-120 | /40} +1x0+1x0+1x0+1x0 = 0.150
S32 =1x {1-|110-150 | /40} +1x0+0x1+1x0+1x0 = 0.000
S34 =1x {1-|110-145 | /40} +1x0+1x0+0x1+0x1 = 0.042
S35 =1x {1-|110-120 | /40} +1x0+1x0+0x1+1x0 = 0.187

S41 =1x {1-|145-120 | /40} +0x1+0x1+1x0+1x0 = 0.125


S42 =1x {1-|145-150 | /40} +1x0+1x0+1x0+1x0 = 0.175
S43 =1x {1-|145-110 | /40} +1x0+1x0+0x1+0x1 = 0.042
S45 =1x {1-|145-120 | /40} +0x1+0x1+0x1+1x0 = 0.187

S51 =1x {1-|120-120 | /40} +0x1+0x1+1x0+0x1 = 0.500


S52 =1x {1-|120-150 | /40} +1x0+1x0+1x0+1x0 = 0.050
S53 =1x {1-|120-110 | /40} +1x0+1x0+0x1+1x0 = 0.187
S54 =1x {1-|120-145 | /40} +0x1+0x1+0x1+1x0 = 0.187
1 2 3 4 5

1 1 0.0625 0.150 0.125 0.500

0.0625 1 0 0.175 .050


2

0.150 0 1 0.042 .187


3

0.125 0.175 0.042 1 .187


4
0..500 0.050 0.187 0.187 1
5
Dissimilarity and Distance Measures

Dissimilarity is the complement of similarity


measures. Some dissimilarity coefficients have
the metric property that
dij + dik > djk for all i, j and k, in which case they
are generally known as distance measures.

Distance measures are most commonly used


measures of similarity between the objects.
Distance Measures

The most commonly used measure of similarity (or


dissimilarity!) is Euclidean Distance (hereafter
referred as ED).
Y
X2 y 2
Object2

Y2-y1
Object1
X2-X1
X1,y1

2-X1)2+ (Y2-Y1)2
Distance=(X
Euclidean Distance
Dissimilarity and Distance Measures

Perhaps the most commonly used distance measure


familiar is Euclidean Distance (hereafter termed as
ED). Use on raw data, however, it may be very
unsatisfactory since its value is largely dependent on
the particular scale selected for the variables.
Example
Weight (lbs) Height (ft/in)
Child 1 60 3.0 ( 36)
Child 2 65 3.5 ( 42)
Child 3 63 4.0 ( 48)
Dissimilarity and Distance Measures

ED are: d12=5.02, d13=3.16 and d23 =2. 06


Height in inches!
If, however, height had been measured in inches. The
ED becomes d12=7.81, d13=12.37 and d23 =(6.32). Please
refer Figure below.
Child 1 is now deemed to be closer to child 2 than child
3 [this is the opposite of the earlier situation when
height is expressed in feet!]
d12=7.81
2
1 d23=6.32

3
d13=12.37
Dissimilarity and Distance Measures

This is the problem.

What to do?
Standardization zik = xik /sk where sk is the standard
deviation of the kth variable.
Distance Measures
Other options include:

1.Sum of squared differences.

2. Replace squared differences by the sum of the


absolute differences of the coordinates.
This approach is known as City Block Approach, but it
has several problems.
Distance Measures

Mahalanobis Distance

The Mahalanobis approach not only performs a


standardization process on data by scaling in terms
of standard deviations but also sums the pooled
within group variance- covariance, which adjusts for
inter correlations among the variables.

In short, Mahalanobis generalised distance procedure


computes a distance measure between objects that is
comparable to the R2 (coefficient of determination) in
regression.
Between Group Similarity
and Distance Measures
In clustering applications, it is also necessary to be
able to define such measures between groups. The
two out of the three problems which arise in finding
suitable definition include:

1.The choice of a summary for each variable to


describe the group. Sensible choice might be
proportions for qualitative variable and means for
quantitative variables.
Between Group Similarity
and Distance Measures

2. Measurement of within- group variation.


3. Construction of measure of similarity or distance
based on the first (1) and making allowance for the
second(2).
Making allowance for within-group variation might be
particularly tricky, if this is not constant from one
group to another, and there is no reason to believe
that it should be.
Distance Measures

Quantitative Variables
A distance measure that geneticists have used
when describing groups, or populations in terms of
gene frequencies is so called genetic distance, dAB
defined as
dAB = (1-cos ) where: cos = (piA piB)
i
The terms piA and piB are the gene frequencies
for the ith allele at a given focus in the two
populations.
The angular transformation for the proportion has a
variance stabilizing role. When several genetic loci are
considered the dAB values are added together.

This approach to measuring distances between groups


can be generalized to qualitative variables merely by
replacing the word locus by variable and allele by
variable category.

If there are p qualitative variables there will be p, d AB


values to add together. As an example, consider the
set of proportions for two hypothetical populations of
red campion .
Example:
Set of proportions for two populations of red campion
Character state Proportions Proportions
Population A Population B
Corolla colour
pink 0.95 0.80
White 0.05 0.20
Coronal scale colour
as petals, pink 0.85 0.75
not as petals, pink 0.01 0.15
not as petals, white 0.14 0.40

Red calyx pigment


Present 0.80 0.60
Absent 0.20 0.40
For corolla colour: dAB = [1-(0.95x0.80) -
(0.05x 0.20) ]
=
0.17
For coronal scale colour:
dAB = {1-(0.85x0.75)1/2- (0.01x 0.15)1/2
-(0.14x0.10)1/2}1/2 =
0.21
For red calyx pigment:
dAB= {1-(0.80x0.60)1/2 (0.20 x0.40)1/2}1/2 = 0.16
The total distance between the two groups is taken as
the sum :
dAB= 0.17+ 0.21+ 0.16 = 0.54
How to Select a Distance Measure?
Use several measures and compare the results to
theoretical or known pattern. Also when the
variables have different units, one should
standardize before running the cluster analysis.
Standardization is particularly advisable when the
range of one variable is much large than of others.

Finally, when the variables are inter-correated,


the Mahalanobis distance measure is the most
appropriate.
Partitioning or Non
Hierarchical Methods
Cluster Analysis: Objective

Group data points into clusters so that the degree of


association is strong between members of the same
cluster (low variance within) and weak between
members of different clusters (high variance
between).
Cluster Analysis: Non Hierarchical
Methods
Clustering refers to identifying the number of
subclasses in a data universe X comprised of n data
samples, and partitioning X into c clusters ( 2 c< n).
C=1, rejection of hypothesis that there are clusters
in the data, where as c=n constitutes the trivial case
where each sample is in a cluster by itself.
Two kinds of c partitions: 1. Hard (or crisp)
2. Soft (or fuzzy).
Issues of Importance in Cluster Analysis

Issue 1
How to measure (mathematical) similarity or
similarity between pairs of observations
-A Simple method could be..
Distance between pairs of feature vectors in a
Feature space. Then we expect that the distance
between points in the same cluster will be
considerably less than the distance between points
in different clusters.
Issues of Importance in Cluster Analysis

Issue 2
How to evaluate the partitions (cluster) once they
are formed?--This could be termed as cluster
validity.
In this case, it is necessary to identify the value of
cluster c that gives the most plausible number of
clusters in the data for the analysis at hand.
Nonhierarchical Methods

A single-pass method is one in which the partition is


created by a single pass through the data set or, if
randomly accessed, in which each compound is
examined only once to decide which cluster it should
be assigned to.

A relocation method is one in which compounds are


moved from one cluster to another to try to
improve on the initial estimation of the clusters
Non hierarchical Methods
The relocating is typically accomplished based on
improving a cost function describing the goodnessof
each resultant cluster.

The nearest neighbor approach is more compound


centered as against the other non hierarchical methods
In it, the environment around each compound is
examined in terms of its most similar neighboring
compounds, with commonality between nearest
neighbors being used as a criterion for cluster
formation.
Clustering Method Proposed by
Professor James Bezdek
We define optimum partitions through a global criterion
function that measures the extent to which the candidate
partitions optimize a weighted sum of squared errors
between data points and cluster centers in feature space.

Many other clustering algorithms are available.


However, the method of clustering must be closely
matched with the particular data under study for the
successful interpretation of the substructure in the data.
C - Means Clustering

Bezdek ( 1981) developed an extremely powerful


classification method to accommodate fuzzy data. It
is an extension of the method c- means, or hard c-
means, when employed in a crisp classification.

For n data samples:


X = { x1 , x2 , x3 , . xn }; .1
Each data sample, xi , defined by m features, i.e.,
xi= { xi1, xi2, xi3, x i} 2
Where each xi in X is an m-dimensional vector of m
elements or m features. Normalize before
classification ( if need be ) the m feature elements
as these could have different units. In a
geometric sense, each xi is a point in m-
dimensional feature space, and the universe of
the data sample, X , is a point set with n
elements in the sample space.

Bezdek [1981] suggested using an objective


approach function for clustering the data into
hyper-spherical clusters. This idea for hard
clustering is as follows:
Each cluster data is a hyper- spherical shape with
a hypothetical geometric cluster center.
z

V2

V1
o
x

Objective function is developed for the two


y purposes simultaneously: first, minimize the
Euclidean Distance (ED) between each data set
and its cluster center, and second, maximize
the ED between cluster centers.
Hard c - Means (HCM)
HCM is used to classify data in a crisp sense
(each data point will be assigned one and only one,
data cluster). In this sense, clusters are also called
partitions- that is, partitions of the data.

Define a family of set { Ai , i=1,2,3,,c} as hard c-


partition of X, where the following set theoretic
forms apply to these partitions:
c

U Ai = X

i=1

Ai Aj = all i j 3

Aj X all i 4
Where X= {x1, x2, x3,xn} is a finite set space of the
universe of data samples, and c is the number of
classes, or partitions ,or clusters , into which we want
to classify the data.
Suppose we have the case where c=2, following
are the set expressions:

A2 = A1 A1 A1 = X and A1 A1 =

These set expressions are equivalent to the


excluded middle laws.

The function theoretic expressions associated


with Eqs. 3, 4, and 5 are:
c
Ai ( xk) = 1 for all k ..7
i=1
Ai ( xk) Aj ( xk) = 0 for all k 8
n
0 < Ai ( xk) < n for all i .9
k=1
Where characteristic function Ai (xk) is defined once again
as
Ai (xk) = 1 x k Ai
0 xk A i . 10

Explanation:
Any sample xk can only and definitely belong to one
of the c classes ( eqs. 7 &8) while eq. 9 implies
that no class is empty and no class is the whole set
X, ( that is the universe).
jth data point in the ith cluster or class , is defined to
be ij = Ai (xj). We now define U matrix comprising
of elements ij ( i=1,2,.c; j=1,2,n): that is, c rows and
n columns. We define a hard c partition space for X as
the following Matrix set:
c n

Mc = U | ij {0,1}, ik =1, 0< ik <n 11


i=1 i=1

Cardinality of any hard c-partition, Mc, is


c

Mc = ( 1/c!) c (-1)c-1 ,
in 12
i=1 i
Example
X ={x1, x2, x3, x4, x5} that is n=5 and suppose we
want to cluster these point into two clusters (c) = 2
Mc = {2(-1) + 25 } = 15

some of the 15 possible hard 2-partition are :

11110 11100 11100 00001


00001 00011 00011 11110
and so on.
Notice that the following two matrices
11110 00001 c1
00001 and 1 1 1 1 0 c2 of the first MU
are not different clustering 2- partitions.
In fact, they are the same 2-partition irrespective
of an arbitrary row swap (relabeling each row of U
matrix as C2 and C1). The cardinality measure gives
the number of unique c-partitions for n data points.
An interesting question now arises : Of all the
possible c-partitions for n data samples, how can we
select the most reasonable cpartition for the
partition space Mc?
Which of the 15 possible hard c-partitions for 5 data
points and 2 classes is the best? The answer is: Use
Objective Function (or classification criteria ) to be
used to classify or cluster the data.
Hard c-means the algorithm is known as a within-class
sum of squared errors approach using a Euclidean norm
to characterize distance, denoted as J (U,v), where U is
the partition matrix, and the parameter, v, is a vector
of cluster centers.
Objective Function
n c

J (U,v) = ik (dik)2 ..
13
k=1 i=1
Where dik is a ED measure (in m dimensional feature
Space, Rm ) between the kth data sample xk and ith
cluster center vi, given by

m=1
dik= d ( xk-vi) = xk vi = ( xkj vij)2
j=1
14
Since each data sample requires m coordinates to
describe its location in Rm-space , each cluster center
also requires m coordinates to describe its location in
this same place. Therefore, the ith cluster is a vector
of length m,
Vi ={ vi1, vi2, vim} 15
where the jth coordinate is calculated by

vij = n ik .xkj / n ik . 16
k=1 k=1

We seek the optimum partition,U*, to be the


partition that produces the minimum value for the
function, J.That is,
J(U*, v*) = min J (U,v) 17
U M

Finding U* is exceedingly difficult as Mc . For


n=25, c=10 Mc 10 18! Fortunately useful and
effective alternative search algorithm has been
devised by [Bezdek, 1981]
Iterative Optimization [Bezdek, 1981]
Iterative optimization is basically like many other
iterative methods in that we start with an initial guess
at the U matrix.

The stepwise procedure is as follows:

1. Fix c (2 c < n) and initialize the U matrix;


U(0) Mc , Then do r =1, 2,
2. Calculate the c center vectors:

{ vi(r) with U(r) }


3. Update U(r) ; calculate the update
characteristic function ( for all i,k):

(r+1) 1 dik(r) =min{djk (r) for all j c


ik = 0 otherwise 18

4. If mod[ U(r+1)- U ( r)] ( tolerance level) ..19


STOP : Otherwise set r =r+1 and return to step 2.
In step 4 the notation is any matrix norm such as the
Euclidean norm.
Example: Performance of
Catalytic Converter
In a chemical engineering process involving an
automobiles catalytic converter ( which converts
CO to CO2) , we have a relationship between
conversion efficiency of the catalytic converter
and the inverse of the temperature of the
catalyst. Two classes of the data are known from
reaction efficiency. Points of high conversion
efficiency and high temperature are indicators of
a no polluting system ( class c1) and points of low
conversion efficiency and low temperature are
indicative of a polluting system ( class c2).
Suppose we measure the conversion efficiency say
( ) and temperature (T) of four different
catalytic converters and attempt to characterize
them in polluting or non polluting.
The four data points ( n=4) are shown in the figure
where the y axis is conversion efficiency and the x
axis is the inverse of the temperature [in a
conversion process like this the exact solution
takes the form in(1/T)]. The data are described by
two features( m=2) and have the following
coordinates in 2D space.
x1= {1, 3}, x2 = {1.5, 3.2},
x3= {1.3, 2.8}, x4 = { 3, 1}
y
C2
3.2
C1
3.0
2.8
U(0)

1.0

x
1 1.3 1.5 2 3
Four Data Points in two- dimensional feature space
To classify these data points into two classes (c=2).It
is desirable to compute cardinality of the possible
number of crisp partitions for the system I.e. to find
Mc .

Mc = ! 2 (-1)1 (1)4 + 2 (-1)0 (2)4


1 2

= [-2 +16] =7 which says that there are seven


ways ( irrespective of row swamps) to classify the
four points into two clusters. Let the initial guess of
the crisp partition,U by assuming x1 to be the class 1
and x2, x3, x4 to be class 2

U(0) = 1 0 0 0
0 1 1 1
We seek optimal partition U* from the seven
partitions. With desired tolerance or convergence
level, : U(0), U(1) U(2). U*
For class 1 we calculate the coordinates of the
cluster center,
V1j = 1x1j + 2x2 j+ 3x3j + 4x4j
(1+2+3+4)
= (1) x1j +(0) x2 j + (0) x3j +(0) x4j
(1+0+0+0)
and vi ={vi1, vi2,.vim}
For m =2, which means we deal with two coordinates
for each data point: vi ={vi1,vi2 }
where for c =1 (class 1) v1= {v11, v12 }; similarly for c
= 2 ( class 2) v2 ={v21, v22 }
Therefore, using expression for vij for c=1, and j =1,
respectively,
V11 =1(1)/1 = 1 x coordinate
V12 = 1(3)/1 = 3 y coordinate v1 ={1,3)
which just happens to be the co ordinates of point x 1,
since this is the only point in the class for the
assumed initial partition, U(0).
For c= 2, we get the cluster center coordinates
V2j = (0) x1j + (1)x2 j+ (1) x3j + (1) x4j

0 + 1 + 1+1
= ( x2 j+ x3j + x4j ) / 3

For c =2 and j =1 and 2, respectively,


V21 = 1(1.5) + 1( 1.3) +1(3)/3 =1.93 x co-ord.
V22 = 1(3.2) + 1(2.8) + 1 (1)/3 =2.33 y co-ord.
v2 = {1.93, 2.33}
We compute the values for dik, or the distance
from the sample xk ( a data set) to the center,vi, of
the ith class. Using eq.14
m 1/2

dik = j=1 (xkj vij) 2

For c =1; d1k =[xk1-v11)2 +(xk2 v12)] .


Compute for each data set k=1 to 4, computation of
dik as follows:
d11 = (1-1)2 + (3-3)2 = 0
d12 = (1.5-1)2 +(3.2-3)2 = 0.54
d13 = (1.3-1)2 + (2.8-3)2 = 0.36
d14 = (3-1)2 +(1-3)2 = 2.83

and for cluster 2:


d21 = (1-1.93)2 + (3- 2.33)2 = 1.14
d22 = (1.5-1.93)2 +(3.2-2.33)2 = 0.97
d23 = (1.3-1.93)2 + (2.8-2.33)2 = 0.78
d24 = (3-1.93)2 +(1-2.33)2 = 1.70
Update the partition to U(1) for each data point( for [c-
1] clusters) using eq. 18. Hence, for class 1 we compare
dik against the minimum of {dik,d2k):
For k=1, 2, 3 and 4, respectively dik
d11=0.0, min(d11,d21)=min(0,1.14)=0.0, thus 11 =1

d12=.54, min(d12,d22)=min(.54,.97)=.54,thus 12 =1

d13=.36, min(d13,d23)=min(.36,.78)=.36,thus 13 =1
d14=2.83,m(d14,d24)=min(2.83,1.70)=1.70, 14 =0
Therefore, the updated partition is
U(0) = 1 1 1 0
0 0 0 1
Since the updated partitions U(0) and U(1) are different ,
we repeat the same procedure based on the new setup
of two classes. For c =1, the center coordinates are
v1j or vj =x1j +x2j + x3j/(1+1+1+0) since x14= 0
v11=x11 +x21 +x31 /3 =1 +1.5 +1.3/3 =1.26
v12=x12 +x22 +x32 /3 =3 +3.2 +2.8/3 = 3.0
v1 ={1.26, 3.0}
and for c =2, the center coordinates are
v2j = or vj = x4j/( 0+0+0+1), since x21 =x23 =0
v21 =3/1 = 3
v22=1/1 = 1 v2 = {3, 1}

Now, we calculate distance measure again:


d11 = (1-1.26)2 + (3- 3)2 = 0.26
d12 = (1.5-1.26)2 +(3.2-3)2 = 0.31
d13 = (1.3-1.26)2 +(2.8-2.33)2 = 0.20
d14 = (3-1.26)2 + (1- 3)2 = 2.65
d21 = (1- 3)2 + (3 - 1)2 = 2.83
d22 = (1.5- 3)2 +(3.2- 1)2 = 2.66
d23 = (1.3- 3)2 +(2.8- 1)2 = 2.47
d24 = (3 - 3)2 + (1 - 1)2 = 0.0
and again update the partition U(1) to U(2) :

d11=.26, min(d11,d21)=min(.26,2.83)=0.26, 11 =1
d12=.31, min(d12,d22)=min(.31, 2.66)=0.31, 12 =1
d13=.20, min(d13,d23)=min(.20, 2.47)=0.20, 13 =1
d14=2.65,m(d14,d24)=min(2.65,0) =0.0, 14 =0
Because the partition U(1) and U (2) are identical, we
could say the iterative process has converged;
therefore, the optimum hard partition ( crisp) is

U(*) = 1 1 1 0
0 0 0 1
The optimum partition tells us that for the
catalytic converter example, the data points x1, x2
and x3 are more indicative of a non polluting
converter than is data point x4.
y
Final Two Clusters
3.2
C1
3.0
2.8
U(Final)

C2
1.0

x
1 1.3 1.5 2 3
Four Data Points in two- dimensional feature space
HCM is also known as k
means Clustering)
What Next?
A Primer

Classical Logic and Fuzzy


logic

Professor Ashok
Deshpande
Fuzzy c Means (FCM)

Why Fuzzy C- Means?


Three Examples
Problem 1

Iterative Optimization
To which class this point belong?

Butterfly Classification Problem


( Bezdek, 1981)
Butterfly Classification Example 1
A good example of the iterative optimization method
is provided with the butterfly problem. We have 15
data points and one of them is on a vertical line of
the symmetry (the point in the middle class of the
data cluster). If c =2 classes and the point in the
right of the line of symmetry should be in the other
class, the problem lies in assigning the point on the
line of symmetry to a class.
To which class this point belong? Whichever class
the algorithm assign this point to, there will be a
good argument that it should be a member of the
other class.
Butterfly Classification Problem

Alternatively, the argument may revolve around the


fact that the choice of two classes is a poor one for
this problem.Three classes might be the best
choice, but the physics underlying the data might be
binary and two classes may be the only option.
In conducting the iterative optimization approach
we have to assume an initial U matrix. This matrix
will have two rows ( two classes, c =2) and 16
columns ( 15 data points, n =15). It is important to
understand that the classes may be unlabeled in this
process.
Butterfly Classification Problem

That is, we can look at the structure of the data


without the need for the assignment of labels to
the classes.This is often the case when one is first
looking at a group of data. After several iterations
with the data, and as we become more and more
knowledgeable about the data, we can then assign
labels to the classes.We start the solution with the
assumption that the point in the middle is assigned
to the class represented by the bottom row of the
initial U matrix, U(0)
1 11111000000000
U(0) = 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
Butterfly Classification Problem

After four iterations ( Bezdek, 1981) this method


converges to within a tolerance level of =0.01 as

U(4)= 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

The point in the line of symmetry has full


membership in the second class and no membership in
the first class.
We can see from the shape that that the eighth point
should share membership with each class. This is not
possible with crisp classification; membership is binary-
a point is either a member of the class or it is not.
Example 2
Looking at the picture, on X axis, we may identify two
clusters in proximity of the two data concentrations.
We will refer to them using A and B. In the first
approach; the k-means algorithm - we associated each
datum to a specific centroid; therefore, this
membership function looked like this:
Fuzzy c-Means
In the FCM approach, instead, the same given datum
does not belong exclusively to a well defined cluster,
but it can be placed in a middle way. In this case, the
membership function follows a smoother line to
indicate that every datum may belong to several
clusters with different values of the membership
coefficient.
Fuzzy c-Means

In the figure above, the datum shown as a marked


spot( ) belongs more to the B cluster rather than
the A cluster. The value 0.2 of m indicates the
degree of membership to A for such datum. Now,
instead of using a graphical representation, we
introduce a matrix U whose factors are the ones
taken from the membership functions:
1 0 0.8 0.2
0 1
U N Xc UNxC 0.3 0.7
1 0 0.6 0.4
0 1 0.9 0.1

c1 c2 c1 c2
Example-3
Suppose you are a fruit geneticist interested in
genetic relationships among fruits. In particular, you
know that a tangelo is a cross between a grapefruit
and a tangerine. You describe the fruit with such
features as colour, weight, sphericity, sugar content,
skin texture and so on. Hence, your feature space
could be highly dimensional.

Suppose you have three fruits ( three data points)


X ={x1= grapefruit, x2= tangelo,x3= tangerine } and
are described as m features.
Continued

Let us classify the three fruits into two classes (as


per the crisp classification) to decide the genetic
assignment for the three fruits. Cardinality for this
case where n=3 and c=2 is Mc =3. Arrange U matrix as
follows:
x1 x2 x3
U = c1 1 0 0
c2 0 1 1

The three possible partition


matrix are:
x1= grapefruit, x2= tangelo, x 3=
tangerine
C1 1 0 0 1 1 0 1 0 1
C2 0 1 1 0 0 1 0 1 0
First crisp partition uncomfortable segregation:
grapefruit is in one class and the tangelo and tangerine
in the other; indicating that the two share nothing in
common with the grapefruit!
Second partition: Grapefruit and the tangelo in one
class, suggesting that they share nothing common with
tangerine!
Finally, the third partition is the most genetically
discomforting of all, because here the tangelo is in a
class by itself, sharing nothing in common with the
progenitors! One of the three partition will be the final
answer. Which one is the best? The answer is NONE.
What to do??? Is this a fuzzy case? YES.
In fuzzy case. this segregation and genetic absurdity is
not a problem. We can have most intuitive situation
where the tangelo shares membership with both classes
with the parents. The following partition might be a
typical outcome for the fruit genetics problem:
x1( grape) x2 (Tangelo) x3( tangerine)
1 0.91 0.58 0.13
U= 2 0.09 0.42 0.87

We can show that the sum of each row is a number between 0 and n:
0 <n1k =1.62 and n2k =1.38, and that is less than 3.
11 ^ 21 = min ( 0.91, 0.09) = 0.09 0
12 ^ 22 = min ( 0.58, 0.42) = 0.42 0 and 13 ^ 23= 0.13 0
How??
Fuzzy c-Means
Algorithm
Fuzzy c-Means

The Fuzzy c- Means Algorithm (FCM) generalizes


the hard c- means algorithm to allow a point
partially belong to multiple clusters. Therefore, it
produces a soft partition for a given data set. The
objective function J1 of hard c- means is extended in
two ways.

Firstly, The fuzzy membership degrees in clusters


were incorporated into the formulation,
Fuzzy c-Means

and secondly, an additional parameter m was


introduced as a weight exponent in the fuzzy
membership.

The extended objective function, denoted Jm, is


k
Jm (P,V)= { ci (xk)} m || xk- j || 2
i=1 xk x [1]

Where P is a fuzzy partition of the dataset X formed


by C1,C2,.Ck. The parameter m is a weight that
determines the degree to which partial members of a
cluster affect the clustering result.
Fuzzy c-Means

The algorithm also tries to find a good partition by


searching for prototypes that minimizes the
objective function. The fuzzy c- means also needs to
search for membership function c, that minimizes
J m.
To accomplish these two objectives, a necessary
condition for local minimum of Jm was derived from
Jm. This condition, which is formally stated, serves
as the foundation of the fuzzy c-means algorithm.
Fuzzy c -Means Theorem (A
Summary)
A constrained fuzzy partition {C1,C2,..C4} can be
a local minimum of the objective function Jm only
if the following conditions are satisfied:
1
___________________________
c i (x) = k
{|| x- vi || 2/ || x vj || 2}1/m-1
j=1

1 i k, x X [2]
k
[ C i (x) ] m
x
x X
Vi = ________________________________
k
[ C i (x) ] m
x X
1 i
k

[3]
Based on this theorem, FCM updates the
prototypes and the MF iteratively using equation [2]
and [3] until a convergence criterion is reached .
The Fuzzy c-means (FCM) Algorithm

FCM {X, C, m, }
X: an unlabeled data set ; C: the number of clusters
to form ; m is the parameter in the objective
function , and is threshold for the convergence
criteria
Initiate prototype V={v1,v2,., vc}
Repeat V Previous V; Compute MF using equation 3),
update prototype ,vi in V using equation 2
c
until || vi Previous vi ||
i=1
Problem on Fuzzy c-Means

Suppose we are given a data f1 f2


set of six points, each of
X1 2 12
which has two features f1 and
f2 . Assuming that we want to X2 4 9
use FCM to partition the data
X3 7 13
into two clusters (c=2).
Suppose we set the X4 11 5
parameter m at 2, and the
X5 12 7
initial prototypes to
v1=(5,5) v2= (10,10). X6 14 4
f2 An Example of Fuzzy c-Means Algorithm

15 (7,13)

(2,12) x3

x1
Feature Distance (f2)

10 + +v2 (12,7)
(4,9)
x2 + x5
(10,10)

(11,5)
(14,4)
5 +v1 (5,5) x4 x6

0 f1
5 10 15

Feature Distance (f1)


f2 Example of Fuzzy c-Means Algorithm

15
V1 [ [6.6273,9.1484]
x3

X1 [2,12]
2 +v2
8 V2 [9.7374,8.4887]
10 + [10,10]
x2 [4,9] x5
7

5 +v1 [5,5] x4 x6
3

0 f1
5 10 15
Problem on Fuzzy c-Means

The initial Membership function of the two clusters


are calculated using equation 2;
ci (x) _________1____________
2

{|| xi-vj || / || xi-vj ||}2


j=1

||x1-v1 || 2 =ll 2-5ll2+ ll12-5ll2=32+72 = 58


Similarly,
||x1 -v2|| 2 = 82+22 = 68
c1 (x1) = 1/ { 58/58+ 58/68} = 0.5397
c2 (x1) = 1/ { 68/58+ 68/68} = 0.4603
Problem on Fuzzy c-Means

Similarly, we obtain the following:

C 1 (x2) = 1/ { 17/17+ 17/17} = 0.6852

C 2 (x2) = 1/ { 37/17+ 37/37} = 0.3148

C 1 (x3) = 1/ { 68/68+ 68/18} = 0.2093

C 2 (x3) = 1/ { 18/58+ 18/18} = 0.7907

C 1 (x4) = 1/ { 36/36+ 36/36} = 0.4194

C 2 (x4) = 1/ { 26/36+ 26/26} = 0.5806


Problem on Fuzzy c-Means

C 1 (x5) = 1/ { 53/53+ 53/13} = 0.1970

C 2 (x5) = 1/ { 13/53+ 13/13} = 0.8030

C 1 (x6) = 1/ { 82/82+ 82/52} = 0.3881

C 2 (x6) = 1/ { 52/82+ 52/52} = 0.6119

Therefore, using these initial prototypes of the


clusters, the MF indicated that x1 and x2 are more in
the first cluster, while the remaining points in the
dataset are in the second cluster.

The FCM algorithm then updates the prototypes


according to equation 3;
Problem on Fuzzy c-Means
6

{ c1 (xk)}2 x x k

k=1
V1 = ________________________
6
{ c1 (xk)}2
k=1

(0.5397)2x(2,12)+(0.6852)2x(4,9) +(0.2093)2 x (7,13)


+(0.4194)2x (11,5) +(0.1970 )2x(12,7)+ (0.3881)2x (14,4)
=_________________________________________
(0.5397)2+(0.6852)2+(0.2093)2 +(0.4194)2+(0.1970 )2+ (0.3881)2
={(7.2761/1.0979), (10.044/1.0979)}
=(6.6273,9.1484)
Problem on Fuzzy c-Means
6
{ c2 (xk)}2 x x k
k=1
V2= ________________________
6
{ c1 (xk)}2
k=1

(0.4609)2x(2,12) +(0.3148)2x(4,9)+(0.7909)2x(7,13)+
0.5806)2x (11,5) +0.803)2x(12,7)+(0.6119)2x(14,4)
_________________________________________
=(0.4609)2+(0.3148)2+(0.7909)2+(0.5806)2+0.803)2
+(0.6119)2

={22.326/2.2928, 19.4629/2.2928}
= (9.7374, 8.4887)
Example of Fuzzy c-Means Algorithm
f2

15
V1 [ [6.6273,9.1484]
x3

X1 [2,12]
2 +v2
8 V2 [9.7374,8.4887]
10 + [10,10]
x2 [4,9] x5
7

5 +v1 [5,5] x4 x6
3

0 f1
5 10 15
Fuzzy C Means :Example
5 x1
x2
4

x3
1.5

0.5 x4

0 1 1.5 2

Fuzzy data points to be clustered using CM


Fuzzy C Means :Example
Consider the following data points.x1 =(1, 5); x2 = ( 2.4);
x3 = ( 1, 1.5) and x4 = 1.5, 0.5).
Using weighting factor of m = 2 and L=0.05;
To start with initial partition
U0 = 1 0 1 0
0 1 0 1
Each of the cluster coordinates for each calss can be
defined by
n n

Vij = ik xkj / ik m where j= 1,2,m in m dimensional space.

k=1 k=1
For c=1; V11 = 1(1) + 1 (1) /2 =1
V12 = 5(1) + 1.5 (1) /2 =3.25

Therefore V1 = (1, 3.25)

For c=2; V21 = 2(1) + 1.5 (1) /2 =1


V22 = 4(1) + 0.5 (1) /2 =2.25

Therefore V2 = (1.75, 2.25)

The distance of each data point from each cluster center


are computed using the following expression

dik = { (Xkj - Vij ) 2


}
d11 = (1-1.)2 + (5- 3.25)2 = 1.75
d12 = (2-1)2 +(4-3.25)2 = 1.25
d13 = (1-1)2 +(1.5-3.25)2 = 1.75
d14 = (1.5-1)2 + (0.5- 3.25)2 = 2.795

d21 = (1- 1.75)2 + (5 - 2.25)2 = 2.85


d22 = (2- 1.75)2 +(4- 2.25)2 = 1.76
d23 = (1- 1.75)2 +(1.5- 2.25)2 = 1.06
d24 = (1.5 - 3)2 + (0.5 2.25)2 = 1.76
With the distance measure, we can update U using

( r+1) c (r) (r)


ik = { [dik / djk ] 2
}

j=1

For i=1 and we get:

11 = [ ( d11/d11 )2 + ( d11/d21 ) 2 ]-1 =[1+ (1.75/2.85)2]-1 = 0.726

12 = [ ( d12/d12 )2 + ( d12/d22 ) 2 ]-1 =[1+ (1.25/1.65)2] -1= 0.6647

13 = [ ( d13/d13 )2 + ( d13/d23 ) 2 ]-1 =[1+ (1.75/1.06)2] -1 = 0.268

14 = [ ( d14/d14 )2 + ( d14/d24 ) 2 ]-1 =[1+ (2.795/1.76)2] -1= 0.726


Since ik = 1

1
U = 0.775 0.668 0.268 0.284
0.2738 0.3353 0.732 0.716
(1) (0)
ik - ik = 0.27 .0.05

For next iteration, we proceed by calculating the


cluster centers but from the new partition U1 for
c=1
Fuzzy C Mean Clustering

A Commentary:
Limitations and Research Efforts
Limitations of FCM
The hard clustering panel the dataset into clusters
such that one object exactly belongs to only one
cluster. This procedure is unsuitable for real world
dataset in which the elements are not strict and
there are no distinct boundaries between the
clusters.

After, the fuzzy set theory was introduced by Lotfi.


A. Zadeh, the researchers incorporate the concept of
fuzzy with clustering techniques to deal the
ambiguity of data. So in this work we focus on fuzzy
clustering of unsupervised clustering techniques.
Limitations of FCM

In unsupervised clustering, there is no a priori


information on the given data distribution. The
objective of unsupervised fuzzy clustering is to
allocate each data point to all different clusters with
some degrees of membership. The membership of a
data element is thus shared the characters of
different clusters and it depends on the proximity of
the data elements to the cluster centres.
Limitations of FCM

The fuzzy c-means (FCM) is the significant algorithm


in fuzzy clustering which was proposed by Dunn and
generalized by Bezdek and now it is effectively used
in wide variety of real world problems.

The FCM method has still some shortcoming such as


its require for a large amount of time to converge and
it is highly sensitive to the noise and outliers in the
data, because of squared-norm to compute similarity
between cluster center and data elements. To resolve
these problems, many researchers developed new
modified fuzzy c-means for data analysis, particularly
for image segmentation .
Variants of FCM

The modified entropy based FCM method has


been developed by Cheng and Wei for data
clustering. The modified FCM which includes the
spatial information into the membership function
for clustering was proposed by Chuang et al to
decrease the spurious blobs and remove the
noisy spots in the images. An adaptive weighted
averaging FCM (AWA-FCM) in which the spatial
control of the neighboring pixels on the central
pixel is incorporated, was developed by Jiayin
Kang et al .
Variants of FCM

Though the above algorithms provide good


outcome in clustering the large dataset and image
segmentation, their performance deteriorates
rapidly when the noise level is amplified.
Since the Euclidean distance is used to calculate
dissimilarity in all modified FCM, it fails to deal the
general shaped dataset. And also, the
computational cost is high for large datasets and
fails to rectify the heavy noises and outliers when
using the above listed FCMs.
The limitations of recent fuzzy c means algorithms
in data clustering and medical image segmentation,
could possibly be taken care of by the method
proposed by Dr. S.R. Kannan ( he calls me uncle).
Variants of FCM

Dr. Kannan with his group, presents three new


effective fuzzy c means methods based on hyper-
tangent distance, effective additional and penalty
terms for clustering the complex data structure and
breast medical images.
The effectual objective functions of proposed FCMs
are essentially developed for improving the robustness
of getting meaningful clusters and robustness to noise
and outliers, desirable memberships, and improve the
similarity measurement in real world datasets.

The research is in progress at his end. I wish him a


great success.

You might also like