You are on page 1of 13

Clustering methods

Well look at three unsupervised clustering methods. Univariate clustering Evaluates individual variables (raw or scaled). Groups samples into homogeneous classes. Hierarchical cluster analysis Reduction of multiple variables for a sample to a single distance value. Rank and link samples based on relative distances. k-mean clustering. Grouping of samples into a set number of classes. Use all variables to determine relative distances.

Cluster analysis
The basic assumption with these methods is that measurements made for related samples tend to be similar. Overall, the distance between similar samples is smaller than for unrelated samples.

Univariate clustering.

Iris dataset
Species Property I. Setosa Petal width I.Versicolor Petal length I.Verginica Sepal width ! Sepal length Well look at a single property - petal width.
0.25 0.2

Creates k homogeneous classes. Uses within-class variances as measure of


homogeneity.

Can be used to convert quantitative variable into a


discrete ordinal variable.

Histogram (Petal width)

Another use is to simply evaluate if a variable has


any classication type information.

Relative frequency

0.15

0.1

0.05

0 0 5 10 15 20 25 30

Petal width

Univariate clustering.
The goal is to partition the data so that you have k clusters of data.
Histogram (Petal width)
0.25

Iris data
A simple ranking of the data indicates that we would get reasonable clustering based on petal width.

0.2

Relative frequency

0.15

0.1

0.05

0 0 5 10 15 20 25 30

Petal width

Iris data

Iris data

Body Temp (from exam)

So whats it good for?


Not exactly the best classication. It does show that there is some skew to the results (more men in class one and more women in class two) and there is a fair amount of overlap.

Really only useful for an initial evaluation of individual variables. Only want to use when you have a small number of classes (or potential classes. Main use is to convert quantitative (continuous) to ordinal data.

HCA Distance and similarity


The rst step in conducting HCA is to determine the distance between your samples or variables. Distance

Distance and similarity


Actual distances between your samples will vary based on the type and number of measurements present. Similarity values are calculated to normalize the data to a standard scale. M=1
2 1/2

d ij = !8^x ik - x jk h B
j =1

1/M

City block

d ij = ! x ik - x jk
j =1

s ij = 1 -

d ij d max

Euclidean (most common)

d ij = !8^x ik - x jk h B
j =1

M=2

! For similar samples, sij approaches 1 ! For dissimilar samples, sij approaches 0

Clustering
After all our distances or similarities have been calculated, we need a way of determining how closely our samples are related or grouped. We start with the two most related samples and link them - forming an initial cluster. The process is repeated until all samples have been linked.

Clustering
Several methods of linking our samples are available. The three most common are: Single link Complete link Centroid link Lets start by looking at the simplest method ! - single link (in two dimensions)

Single link
This approach determines linkage based on the distance to the closest point in a cluster.

Single link

d ij " C = 0.5d iC + 0.5d jC - d iC - d jC


You start by assuming that the two closest points are a cluster. All points are initially compared as pairs and then the search for links is expanded. Now lets look at an example.
Here is our initial data set shown in two dimensions. The example is also relevant to k-mean clustering.

Single link

Single link

dij The two closest points have been linked with a known distance.

The process continues on using the next closest pair.

Single link

Single link

We now have a three member cluster. Lets skip a few steps.

Now, our points have been linked into three clusters.

Single link

Other linkage methods


Complete link Linkage is based on the farthest point in a cluster - gives a conservative linkage

d ij " C = 0.5d iC + 0.5d jC + d iC - d jC

All points have now been linked.

Other linkage methods


Centroid link (Wards Method) Linkage is based on the center of the cluster.
2 2 nj d2 n i n j d ij n i d iC jC d ij " C = n i + n j + n i + n j + n i + n j

HCA dendrograms
After conducting your linkage, you need a way to visualizing the results. Dendrograms can be used for this purpose and provide a very simple two dimensional plot that indicates clustering, similarities and linkages.

Dendrograms
1.0 similarity 0.0
We can now see how our samples are linked. The higher the linkage level, the lower the similarity.

Dendrograms
A B C D E F G H I J

This plot appears to indicate that there are three groups of samples that can only be linked at very low similarity values.

Dendrograms
Lets look again at our single linkage example and see what the dendrogram would look like.

Example dendrogram
1.0 similarity 0.0

A real example
Substances commonly used as accelerants were assayed by capillary column GC / MS. At present, accelerants are identied based on boiling point range. ! Class assignments: A, B, C, D, E Goal: To determine if multivariate data treatment has the potential for classication of accelerants.

Analysis conditions
Neat samples were spiked with a known amount an internal standard. ! SP-5 25m x 0.2mm I.D. column ! 1 l sample, 100:1 split injection ! 50oC,5 min; 10oC/min ramp; hold at 250oC ! Total run time: 30 minutes ! Mass Range: 50-150 AMU ! ISTD: octadeuteronaphthalene

Preprocessing of data
A total ion chromatographic prole was extracted and normalized using the internal standard. Triplicate samples were averaged. The rst minute of data was discarded due to the presence of a solvent tail. The remaining data was simply summed at one minute intervals - 19 variables.

Classes
A. Light petroleum distillates - petroleum eathers, lighter uid, naptha, camping fuels, ... B. Gasoline C. Medium petroleum distillates - paint thinners, mineral spirits, ... D.Kerosene - #1 fuel oil, jet A fuel, ... E. Heavy petroleum distillates - #2 fuel oil, diesel fuel, ...

Representative data proles


A B C D E

Production of dendrograms
Both raw and autoscaled data were processed and dendrograms were produced using single linkage. For the autoscaled data, complete and centroid linkages were also evaluated. For the dendrograms, classes are color coded and labeled. The classes were not used in producing the dendrograms.

As can be seen, Classes B, C and D show a signicant level of overlap.

Raw - single linkage


Single, raw
0.70

Raw - complete linkage


Complete, Raw
0.00 0.10

0.75

0.20 0.30

0.80

Similarity

Similarity

0.40 0.50 0.60 0.70 0.80

0.85

0.90

0.95
0.90

b b b b a a a a b a b b b b b b e e e e e e e d d d d d d d d d d d d c c c c c c c c c c c

b b b b b b b b b b a b a a a a c c c c c c c c c c c e e e e e e e d d d d d d d d d d d d

1.00

1.00

Raw - centroidal linkage


Centroid, Raw
0.30
Similarity
0.70 0.75 0.80

Raw - comparison
Single, raw

Complete, Raw
0.00 0.10 0.20 0.30

Similarity
b b b b a a a a b a b b b b b b e e e e e e e d d d d d d d d d d d d c c c c c c c c c c c

0.40 0.50 0.60 0.70 0.80 0.90

0.40

0.85

0.90

0.50

0.95

1.00

1.00
b b b b b b b b b b a b a a a a c c c c c c c c c c c e e e e e e e d d d d d d d d d d d d

Similarity

0.60
0.30 0.40

Centroid, Raw

0.70
Similarity

0.50

0.60

0.80

0.70

0.80

0.90
0.90 1.00

1.00
e e e e e e e b b a a a a b a b b b b b b b b d d d d d d d d d d d d c c c c c c c c c c c

Centroidal linkage appears to give the best results.

Autoscaled - single link


Single, Scaled
0.56 0.61 0.66 0.71

Autoscaled - complete link


Complete, Scaled
-0.78 -0.58 -0.38 -0.18

Similarity

0.76 0.81 0.86 0.91 0.96

Similarity

0.02 0.22 0.42 0.62 0.82

b b b b b b b b b a a a b a a b e e e e e e e d d d d d d d d d d d d c c c c c c c c c c c

Autoscaled - centroid link


Centroidal, Scaled
-0.34

Autoscaled - centroid link


Centroidal, Scaled

-0.14

-0.34

-0.14

0.06
Similarity

0.06

Similarity

0.26

0.26

0.46

0.46

0.66

0.86

0.86

e e e e e e e d d d d d d d d d d d d a a a b a a b b b b b b b b b b c c c c c c c c c c c

e e e e e e e d d d d d d d d d d d d a a a b a a b b b b b b b b b b c c c c c c c c c c c

0.66

a a a b a a b b b b b b b b b b c c c c c c c c c c c e e e e e e d d d d d d d e d d d d d

For this example, a centroid link best reects what we already know about the data.

Iris dataset
A pretty famous data set published by R.A. Fisher, The Use of Multiple Measurements in Axonmic Problems. Anals of Eugenics, 7, 179-188 (1936). He measured four physical properties of iris to see if they could be used to classify any of three different species. Used length and width of the sepal and petal.

Iris dataset
! ! ! ! ! Species I. Setosa I.Versicolor I.Verginica Property Petal width Petal length Sepal width Sepal length

150 samples - no missing values HCA analysis was conducted on both raw and scaled data. Both single linkage and complete linkage were evaluated.

Autoscaled data, centroidal linkage


40 35

Iris dataset
25

One class is distinct but the other two overlap.

Raw data.
20 P e t a l w i d t h

30

Dissimilarity

25

15

20

15

10

10

5
15 30 Petal length
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 3 3 3 3 3 3 3 2 3 3 3 2 3 2 2 2 2 2 2 2 3 2 2 3 2 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

So it should be possible to classify samples. HCA just does not provide as useful a view as we had hoped for.
45 60

Iris dataset
So there was useful information in the dataset. HCA - not a good tool. Reducing the four measurements into a single one actually make the data worse. Autoscaling - had little or no effect. The actual numbers were all of a similar range. Moral - just because a method doesnt work does not mean that there is no useful information.

Classication of Mycobacteria
Investigators at the CDC wanted to see if it was possible to identify mycobacteria using pattern recognition of an HPLC analysis of mycolic acids. Mycobacteria - include a number of respiratory and non-respiratory pathogens such as M. tuberculosis. C70-C90 -branched -hydroxy mycolic acids were selected as they are known to be in the cell walls of these bacteria.

Classication of Mycobacteria
Eight species were investigated. ! M. asiaticum M. bova ! M. gastri M. gordonae ! M. kansasii M. marinum ! M. szulgai M. tuberculosis 22 mycolic acids were used for the classication. 175 total samples.

Classication of Mycobacteria
Limitation. Although the paper specied that it was necessary to normalized the data to account for variations in sample size, no standards were provided. I chose to normalize to the total peak areas for each sample. This assumes that each species produces about the amount of total mycolic acid and that the response/concentration is the same for each component.

Single linkage
Single linkage shows some clustering of the samples but is not very useful.

Complete linkage

Complete linkage gives some what better results. Well look at this sample again later using other tools.

Identication of Coffee
An attempt was made to identify the source of coffee beans. Sulawesi Costa Rica Ethiopia Sumatra Kenya Columbia Method. Mass spectral analysis of headspace of bean samples. m/e range of 47-99 was used. Six samples were obtained from each source.

Identication of Coffee
The mass spectra represented the sum of spectra for all components present. As is normal with mass spectra, each was normalized to the largest peak. Only raw data was evaluated.

Representative spectra, 47-99 m/e

Single linkage
Sumatra Columbia Sulawesi Costa Rica

Columbia

Kenya

Ethiopia

Sumatra

Sulawesi

Costa Rica

Kenya Ethiopia

Complete linkage
Kenya Ethiopia Columbia Sumatra Sulawesi Costa Rica

So whats it good for?


This is a fast method of initial data exploration. Try all of the options with both raw and scaled data. The plots can be rapidly evaluated. You can also use principal component data. This will be covered in the next unit. When you get ready to go on to other methods of clustering, knowing the best methods for linkage will also be useful.

k-mean clustering

An iterative method where samples are initially partitioned into k classes and a centroid calculated. Must use quantitative variables but can be raw, scaled or PCA based. The positions of all samples are then calculated relative to the centroids and then reassigned to new clusters (if needed) and the process repeated. Classication criteria can include within-class variance, pooled covariance matrix or total inertia matrix. The number of clusters and assignments can vary based on the initial starting points so several iterations are commonly used to nd a constant solution.

k-mean clustering

Position initial class centroids

Test class memberships

Adjust centroids

Retest/repeat

Using XLStat
Classication criteria that can be minimized.

Using XLStat

Trace.

Minimize the within-class variance, giving the most homogeneous clusters. Data should be autoscaled if this is used. Minimize the covariance matrix. More appropriate to use with unscaled data but gives less homogeneous clusters. approach. Normalized version of the Determinate

XLStats version of HCA (Agglomerative


control and is faster because no HCA is conducted.

Determinant.

Hierarchical Clustering - AHC) will do a k-mean analysis but only the trace method

The k-means option provides more clustering However, AHC has an option to allow the routine
to automatically set the number of clusters that appear to exist.

Wilks lambda.

Trace/median. Centroid ends up being based on median not


the mean, like other approaches. Better when there is subclustering of data.

Iris dataset (again).

Iris dataset (again).

Iris dataset (again).

Iris dataset (again).

25

Raw data.
20 P e t a l w i d t h

15

10

15

30 Petal length

45

60

Arson dataset.
Here are the nal class results from the k-mean clustering.

Coffee (a more complete data set.)


CostaRican CostaRican Sulawesi Sulawesi CostaRican CostaRican CostaRican CostaRican CostaRican CostaRican CostaRican CostaRican CostaRican CostaRican Sulawesi Sulawesi Sulawesi Sulawesi Sulawesi Sulawesi Sulawesi Sulawesi Sulawesi Sulawesi Sumatra Sumatra Sumatra Sumatra Sumatra Sumatra Sumatra Sumatra Sumatra Sumatra Sumatra Sumatra Ethiopia Ethiopia Ethiopia Ethiopia Ethiopia Ethiopia Ethiopia Ethiopia Ethiopia Ethiopia Ethiopia Ethiopia Kenya Kenya Columbia Columbia Columbia Columbia Kenya Kenya Kenya Kenya Kenya Kenya Kenya Kenya Kenya Kenya Columbia Columbia Columbia Columbia Columbia Columbia Columbia Columbia

Centroidal, Scaled
-0.34

-0.14

0.06

Similarity

0.26

0.46

0.66

0.86

e e e e e e e d d d d d d d d d d d d a a a b a a b b b b b b b b b b c c c c c c c c c c c

20

40

60

80

100

120

140

Coffee (a more complete data set.)

Mycobacteria
This data set was VERY difcult to visualize using a dendrogram.
1000 900

800

700

600

500

400

300

200

100

0
46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 46 46 46 44 44 44 44 44 47 44 44 44 44 44 44 44 44 44 44 43 44 43 43 43 43 43 43 43 43 43 43 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 43 43 43 43 43 43 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 43 43 43 43 43 43 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 43 43 45 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47

Mycobacteria

Mycobacteria - autoclustering.

So whats it good for?


Can be used as a way to subdivide a dataset into related clusters. Clusters are objectively determined based on similarities in multidimensional space. While results can vary based on starting point, the effect can be minimized by using multiple starting points and repetitions. Results are easier to see than with HCA. k-mean and HCA complement each other.

You might also like