10 Cluster Analysis

Clustering methods
Well look at three unsupervised clustering methods. Univariate clustering Evaluates individual variables (raw or scaled). Groups samples into homogeneous classes. Hierarchical cluster analysis Reduction of multiple variables for a sample to a single distance value. Rank and link samples based on relative distances. k-mean clustering. Grouping of samples into a set number of classes. Use all variables to determine relative distances.
Cluster analysis
The basic assumption with these methods is that measurements made for related samples tend to be similar. Overall, the distance between similar samples is smaller than for unrelated samples.
Univariate clustering.
Iris dataset
Species Property I. Setosa Petal width I.Versicolor Petal length I.Verginica Sepal width ! Sepal length Well look at a single property - petal width.
0.25 0.2
Creates k homogeneous classes. Uses within-class variances as measure of

homogeneity.
Can be used to convert quantitative variable into a

discrete ordinal variable.
Histogram (Petal width)
Another use is to simply evaluate if a variable has

any classication type information.
Relative frequency
0.15
0.1
0.05
0 0 5 10 15 20 25 30
Petal width
Univariate clustering.
The goal is to partition the data so that you have k clusters of data.
Histogram (Petal width)
0.25
Iris data
A simple ranking of the data indicates that we would get reasonable clustering based on petal width.
0.2
Relative frequency
0.15
0.1
0.05
0 0 5 10 15 20 25 30
Petal width
Iris data
Iris data
Body Temp (from exam)
So whats it good for?

Not exactly the best classication. It does show that there is some skew to the results (more men in class one and more women in class two) and there is a fair amount of overlap.
Really only useful for an initial evaluation of individual variables. Only want to use when you have a small number of classes (or potential classes. Main use is to convert quantitative (continuous) to ordinal data.
HCA Distance and similarity

The rst step in conducting HCA is to determine the distance between your samples or variables. Distance
Distance and similarity

Actual distances between your samples will vary based on the type and number of measurements present. Similarity values are calculated to normalize the data to a standard scale. M=1
2 1/2
d ij = !8^x ik - x jk h B
j =1
1/M
City block
d ij = ! x ik - x jk
j =1
s ij = 1 -
d ij d max
Euclidean (most common)
d ij = !8^x ik - x jk h B
j =1
M=2
! For similar samples, sij approaches 1 ! For dissimilar samples, sij approaches 0
Clustering
After all our distances or similarities have been calculated, we need a way of determining how closely our samples are related or grouped. We start with the two most related samples and link them - forming an initial cluster. The process is repeated until all samples have been linked.
Clustering
Several methods of linking our samples are available. The three most common are: Single link Complete link Centroid link Lets start by looking at the simplest method ! - single link (in two dimensions)
Single link
This approach determines linkage based on the distance to the closest point in a cluster.
Single link
d ij " C = 0.5d iC + 0.5d jC - d iC - d jC

You start by assuming that the two closest points are a cluster. All points are initially compared as pairs and then the search for links is expanded. Now lets look at an example.
Here is our initial data set shown in two dimensions. The example is also relevant to k-mean clustering.
Single link
Single link
dij The two closest points have been linked with a known distance.
The process continues on using the next closest pair.
Single link
Single link
We now have a three member cluster. Lets skip a few steps.
Now, our points have been linked into three clusters.
Single link
Other linkage methods

Complete link Linkage is based on the farthest point in a cluster - gives a conservative linkage
d ij " C = 0.5d iC + 0.5d jC + d iC - d jC
All points have now been linked.
Other linkage methods

Centroid link (Wards Method) Linkage is based on the center of the cluster.
2 2 nj d2 n i n j d ij n i d iC jC d ij " C = n i + n j + n i + n j + n i + n j
HCA dendrograms
After conducting your linkage, you need a way to visualizing the results. Dendrograms can be used for this purpose and provide a very simple two dimensional plot that indicates clustering, similarities and linkages.
Dendrograms
1.0 similarity 0.0
We can now see how our samples are linked. The higher the linkage level, the lower the similarity.
Dendrograms
A B C D E F G H I J
This plot appears to indicate that there are three groups of samples that can only be linked at very low similarity values.
Dendrograms
Lets look again at our single linkage example and see what the dendrogram would look like.
Example dendrogram
1.0 similarity 0.0
A real example
Substances commonly used as accelerants were assayed by capillary column GC / MS. At present, accelerants are identied based on boiling point range. ! Class assignments: A, B, C, D, E Goal: To determine if multivariate data treatment has the potential for classication of accelerants.
Analysis conditions
Neat samples were spiked with a known amount an internal standard. ! SP-5 25m x 0.2mm I.D. column ! 1 l sample, 100:1 split injection ! 50oC,5 min; 10oC/min ramp; hold at 250oC ! Total run time: 30 minutes ! Mass Range: 50-150 AMU ! ISTD: octadeuteronaphthalene
Preprocessing of data
A total ion chromatographic prole was extracted and normalized using the internal standard. Triplicate samples were averaged. The rst minute of data was discarded due to the presence of a solvent tail. The remaining data was simply summed at one minute intervals - 19 variables.
Classes
A. Light petroleum distillates - petroleum eathers, lighter uid, naptha, camping fuels, ... B. Gasoline C. Medium petroleum distillates - paint thinners, mineral spirits, ... D.Kerosene - #1 fuel oil, jet A fuel, ... E. Heavy petroleum distillates - #2 fuel oil, diesel fuel, ...
Representative data proles

A B C D E
Production of dendrograms
Both raw and autoscaled data were processed and dendrograms were produced using single linkage. For the autoscaled data, complete and centroid linkages were also evaluated. For the dendrograms, classes are color coded and labeled. The classes were not used in producing the dendrograms.
As can be seen, Classes B, C and D show a signicant level of overlap.
Raw - single linkage

Single, raw
0.70
Raw - complete linkage

Complete, Raw
0.00 0.10
0.75
0.20 0.30
0.80
Similarity
Similarity
0.40 0.50 0.60 0.70 0.80
0.85
0.90
0.95
0.90
b b b b a a a a b a b b b b b b e e e e e e e d d d d d d d d d d d d c c c c c c c c c c c
b b b b b b b b b b a b a a a a c c c c c c c c c c c e e e e e e e d d d d d d d d d d d d
1.00
1.00
Raw - centroidal linkage

Centroid, Raw
0.30
Similarity
0.70 0.75 0.80
Raw - comparison
Single, raw
Complete, Raw
0.00 0.10 0.20 0.30
Similarity
b b b b a a a a b a b b b b b b e e e e e e e d d d d d d d d d d d d c c c c c c c c c c c
0.40 0.50 0.60 0.70 0.80 0.90
0.40
0.85
0.90
0.50
0.95
1.00
1.00
b b b b b b b b b b a b a a a a c c c c c c c c c c c e e e e e e e d d d d d d d d d d d d
Similarity
0.60
0.30 0.40
Centroid, Raw
0.70
Similarity
0.50
0.60
0.80
0.70
0.80
0.90
0.90 1.00
1.00
e e e e e e e b b a a a a b a b b b b b b b b d d d d d d d d d d d d c c c c c c c c c c c
Centroidal linkage appears to give the best results.
Autoscaled - single link

Single, Scaled
0.56 0.61 0.66 0.71
Autoscaled - complete link

Complete, Scaled
-0.78 -0.58 -0.38 -0.18
Similarity
0.76 0.81 0.86 0.91 0.96
Similarity
0.02 0.22 0.42 0.62 0.82
b b b b b b b b b a a a b a a b e e e e e e e d d d d d d d d d d d d c c c c c c c c c c c
Autoscaled - centroid link

Centroidal, Scaled
-0.34
Autoscaled - centroid link

Centroidal, Scaled
-0.14
-0.34
-0.14
0.06
Similarity
0.06
Similarity
0.26
0.26
0.46
0.46
0.66
0.86
0.86
e e e e e e e d d d d d d d d d d d d a a a b a a b b b b b b b b b b c c c c c c c c c c c
0.66
a a a b a a b b b b b b b b b b c c c c c c c c c c c e e e e e e d d d d d d d e d d d d d
For this example, a centroid link best reects what we already know about the data.
Iris dataset
A pretty famous data set published by R.A. Fisher, The Use of Multiple Measurements in Axonmic Problems. Anals of Eugenics, 7, 179-188 (1936). He measured four physical properties of iris to see if they could be used to classify any of three different species. Used length and width of the sepal and petal.
Iris dataset
! ! ! ! ! Species I. Setosa I.Versicolor I.Verginica Property Petal width Petal length Sepal width Sepal length
150 samples - no missing values HCA analysis was conducted on both raw and scaled data. Both single linkage and complete linkage were evaluated.
Autoscaled data, centroidal linkage

40 35
Iris dataset
25
One class is distinct but the other two overlap.
Raw data.
20 P e t a l w i d t h
30
Dissimilarity
25
15
20
15
10
10
5
15 30 Petal length
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 3 3 3 3 3 3 3 2 3 3 3 2 3 2 2 2 2 2 2 2 3 2 2 3 2 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
So it should be possible to classify samples. HCA just does not provide as useful a view as we had hoped for.
45 60
Iris dataset
So there was useful information in the dataset. HCA - not a good tool. Reducing the four measurements into a single one actually make the data worse. Autoscaling - had little or no effect. The actual numbers were all of a similar range. Moral - just because a method doesnt work does not mean that there is no useful information.
Classication of Mycobacteria
Investigators at the CDC wanted to see if it was possible to identify mycobacteria using pattern recognition of an HPLC analysis of mycolic acids. Mycobacteria - include a number of respiratory and non-respiratory pathogens such as M. tuberculosis. C70-C90 -branched -hydroxy mycolic acids were selected as they are known to be in the cell walls of these bacteria.
Eight species were investigated. ! M. asiaticum M. bova ! M. gastri M. gordonae ! M. kansasii M. marinum ! M. szulgai M. tuberculosis 22 mycolic acids were used for the classication. 175 total samples.
Limitation. Although the paper specied that it was necessary to normalized the data to account for variations in sample size, no standards were provided. I chose to normalize to the total peak areas for each sample. This assumes that each species produces about the amount of total mycolic acid and that the response/concentration is the same for each component.
Single linkage
Single linkage shows some clustering of the samples but is not very useful.
Complete linkage
Complete linkage gives some what better results. Well look at this sample again later using other tools.
Identication of Coffee
An attempt was made to identify the source of coffee beans. Sulawesi Costa Rica Ethiopia Sumatra Kenya Columbia Method. Mass spectral analysis of headspace of bean samples. m/e range of 47-99 was used. Six samples were obtained from each source.
Identication of Coffee
The mass spectra represented the sum of spectra for all components present. As is normal with mass spectra, each was normalized to the largest peak. Only raw data was evaluated.
Representative spectra, 47-99 m/e
Single linkage
Sumatra Columbia Sulawesi Costa Rica
Columbia
Kenya
Ethiopia
Sumatra
Sulawesi
Costa Rica
Kenya Ethiopia
Complete linkage
Kenya Ethiopia Columbia Sumatra Sulawesi Costa Rica

This is a fast method of initial data exploration. Try all of the options with both raw and scaled data. The plots can be rapidly evaluated. You can also use principal component data. This will be covered in the next unit. When you get ready to go on to other methods of clustering, knowing the best methods for linkage will also be useful.
k-mean clustering

An iterative method where samples are initially partitioned into k classes and a centroid calculated. Must use quantitative variables but can be raw, scaled or PCA based. The positions of all samples are then calculated relative to the centroids and then reassigned to new clusters (if needed) and the process repeated. Classication criteria can include within-class variance, pooled covariance matrix or total inertia matrix. The number of clusters and assignments can vary based on the initial starting points so several iterations are commonly used to nd a constant solution.
k-mean clustering
Position initial class centroids
Test class memberships
Adjust centroids
Retest/repeat
Using XLStat
Classication criteria that can be minimized.
Using XLStat
Trace.
Minimize the within-class variance, giving the most homogeneous clusters. Data should be autoscaled if this is used. Minimize the covariance matrix. More appropriate to use with unscaled data but gives less homogeneous clusters. approach. Normalized version of the Determinate
XLStats version of HCA (Agglomerative

control and is faster because no HCA is conducted.
Determinant.
Hierarchical Clustering - AHC) will do a k-mean analysis but only the trace method
The k-means option provides more clustering However, AHC has an option to allow the routine
to automatically set the number of clusters that appear to exist.
Wilks lambda.
Trace/median. Centroid ends up being based on median not

the mean, like other approaches. Better when there is subclustering of data.
Iris dataset (again).
25
Raw data.
20 P e t a l w i d t h
15
10
15
30 Petal length
45
60
Arson dataset.
Here are the nal class results from the k-mean clustering.
Coffee (a more complete data set.)

CostaRican CostaRican Sulawesi Sulawesi CostaRican CostaRican CostaRican CostaRican CostaRican CostaRican CostaRican CostaRican CostaRican CostaRican Sulawesi Sulawesi Sulawesi Sulawesi Sulawesi Sulawesi Sulawesi Sulawesi Sulawesi Sulawesi Sumatra Sumatra Sumatra Sumatra Sumatra Sumatra Sumatra Sumatra Sumatra Sumatra Sumatra Sumatra Ethiopia Ethiopia Ethiopia Ethiopia Ethiopia Ethiopia Ethiopia Ethiopia Ethiopia Ethiopia Ethiopia Ethiopia Kenya Kenya Columbia Columbia Columbia Columbia Kenya Kenya Kenya Kenya Kenya Kenya Kenya Kenya Kenya Kenya Columbia Columbia Columbia Columbia Columbia Columbia Columbia Columbia
Centroidal, Scaled
-0.34
-0.14
0.06
Similarity
0.26
0.46
0.66
0.86
20
40
60
80
100
120
140
Coffee (a more complete data set.)
Mycobacteria
This data set was VERY difcult to visualize using a dendrogram.
1000 900
800
700
600
500
400
300
200
100
0
46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 46 46 46 44 44 44 44 44 47 44 44 44 44 44 44 44 44 44 44 43 44 43 43 43 43 43 43 43 43 43 43 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 43 43 43 43 43 43 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 43 43 43 43 43 43 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 43 43 45 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47
Mycobacteria
Mycobacteria - autoclustering.

Can be used as a way to subdivide a dataset into related clusters. Clusters are objectively determined based on similarities in multidimensional space. While results can vary based on starting point, the effect can be minimized by using multiple starting points and repetitions. Results are easier to see than with HCA. k-mean and HCA complement each other.

10 Cluster Analysis

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

10 Cluster Analysis

Uploaded by

Copyright:

Available Formats

Clustering methods

Creates k homogeneous classes. Uses within-class variances as measure of

Can be used to convert quantitative variable into a

Histogram (Petal width)

Another use is to simply evaluate if a variable has

Body Temp (from exam)

So whats it good for?

HCA Distance and similarity

Distance and similarity

Euclidean (most common)

d ij " C = 0.5d iC + 0.5d jC - d iC - d jC

The process continues on using the next closest pair.

We now have a three member cluster. Lets skip a few steps.

Now, our points have been linked into three clusters.

Other linkage methods

d ij " C = 0.5d iC + 0.5d jC + d iC - d jC

All points have now been linked.

Other linkage methods

Representative data proles

As can be seen, Classes B, C and D show a signicant level of overlap.

Raw - single linkage

Raw - complete linkage

0.40 0.50 0.60 0.70 0.80

Raw - centroidal linkage

0.40 0.50 0.60 0.70 0.80 0.90

Centroidal linkage appears to give the best results.

Autoscaled - single link

Autoscaled - complete link

0.76 0.81 0.86 0.91 0.96

0.02 0.22 0.42 0.62 0.82

Autoscaled - centroid link

Autoscaled - centroid link

Autoscaled data, centroidal linkage

One class is distinct but the other two overlap.

Representative spectra, 47-99 m/e

So whats it good for?

Position initial class centroids

Test class memberships

XLStats version of HCA (Agglomerative

Trace/median. Centroid ends up being based on median not

Iris dataset (again).

Iris dataset (again).

Iris dataset (again).

Iris dataset (again).

Coffee (a more complete data set.)

Coffee (a more complete data set.)

So whats it good for?

You might also like