Professional Documents
Culture Documents
Well look at three unsupervised clustering methods. Univariate clustering Evaluates individual variables (raw or scaled). Groups samples into homogeneous classes. Hierarchical cluster analysis Reduction of multiple variables for a sample to a single distance value. Rank and link samples based on relative distances. k-mean clustering. Grouping of samples into a set number of classes. Use all variables to determine relative distances.
Cluster analysis
The basic assumption with these methods is that measurements made for related samples tend to be similar. Overall, the distance between similar samples is smaller than for unrelated samples.
Univariate clustering.
Iris dataset
Species Property I. Setosa Petal width I.Versicolor Petal length I.Verginica Sepal width ! Sepal length Well look at a single property - petal width.
0.25 0.2
Relative frequency
0.15
0.1
0.05
0 0 5 10 15 20 25 30
Petal width
Univariate clustering.
The goal is to partition the data so that you have k clusters of data.
Histogram (Petal width)
0.25
Iris data
A simple ranking of the data indicates that we would get reasonable clustering based on petal width.
0.2
Relative frequency
0.15
0.1
0.05
0 0 5 10 15 20 25 30
Petal width
Iris data
Iris data
Really only useful for an initial evaluation of individual variables. Only want to use when you have a small number of classes (or potential classes. Main use is to convert quantitative (continuous) to ordinal data.
d ij = !8^x ik - x jk h B
j =1
1/M
City block
d ij = ! x ik - x jk
j =1
s ij = 1 -
d ij d max
d ij = !8^x ik - x jk h B
j =1
M=2
! For similar samples, sij approaches 1 ! For dissimilar samples, sij approaches 0
Clustering
After all our distances or similarities have been calculated, we need a way of determining how closely our samples are related or grouped. We start with the two most related samples and link them - forming an initial cluster. The process is repeated until all samples have been linked.
Clustering
Several methods of linking our samples are available. The three most common are: Single link Complete link Centroid link Lets start by looking at the simplest method ! - single link (in two dimensions)
Single link
This approach determines linkage based on the distance to the closest point in a cluster.
Single link
Single link
Single link
dij The two closest points have been linked with a known distance.
Single link
Single link
Single link
HCA dendrograms
After conducting your linkage, you need a way to visualizing the results. Dendrograms can be used for this purpose and provide a very simple two dimensional plot that indicates clustering, similarities and linkages.
Dendrograms
1.0 similarity 0.0
We can now see how our samples are linked. The higher the linkage level, the lower the similarity.
Dendrograms
A B C D E F G H I J
This plot appears to indicate that there are three groups of samples that can only be linked at very low similarity values.
Dendrograms
Lets look again at our single linkage example and see what the dendrogram would look like.
Example dendrogram
1.0 similarity 0.0
A real example
Substances commonly used as accelerants were assayed by capillary column GC / MS. At present, accelerants are identied based on boiling point range. ! Class assignments: A, B, C, D, E Goal: To determine if multivariate data treatment has the potential for classication of accelerants.
Analysis conditions
Neat samples were spiked with a known amount an internal standard. ! SP-5 25m x 0.2mm I.D. column ! 1 l sample, 100:1 split injection ! 50oC,5 min; 10oC/min ramp; hold at 250oC ! Total run time: 30 minutes ! Mass Range: 50-150 AMU ! ISTD: octadeuteronaphthalene
Preprocessing of data
A total ion chromatographic prole was extracted and normalized using the internal standard. Triplicate samples were averaged. The rst minute of data was discarded due to the presence of a solvent tail. The remaining data was simply summed at one minute intervals - 19 variables.
Classes
A. Light petroleum distillates - petroleum eathers, lighter uid, naptha, camping fuels, ... B. Gasoline C. Medium petroleum distillates - paint thinners, mineral spirits, ... D.Kerosene - #1 fuel oil, jet A fuel, ... E. Heavy petroleum distillates - #2 fuel oil, diesel fuel, ...
Production of dendrograms
Both raw and autoscaled data were processed and dendrograms were produced using single linkage. For the autoscaled data, complete and centroid linkages were also evaluated. For the dendrograms, classes are color coded and labeled. The classes were not used in producing the dendrograms.
0.75
0.20 0.30
0.80
Similarity
Similarity
0.85
0.90
0.95
0.90
b b b b a a a a b a b b b b b b e e e e e e e d d d d d d d d d d d d c c c c c c c c c c c
b b b b b b b b b b a b a a a a c c c c c c c c c c c e e e e e e e d d d d d d d d d d d d
1.00
1.00
Raw - comparison
Single, raw
Complete, Raw
0.00 0.10 0.20 0.30
Similarity
b b b b a a a a b a b b b b b b e e e e e e e d d d d d d d d d d d d c c c c c c c c c c c
0.40
0.85
0.90
0.50
0.95
1.00
1.00
b b b b b b b b b b a b a a a a c c c c c c c c c c c e e e e e e e d d d d d d d d d d d d
Similarity
0.60
0.30 0.40
Centroid, Raw
0.70
Similarity
0.50
0.60
0.80
0.70
0.80
0.90
0.90 1.00
1.00
e e e e e e e b b a a a a b a b b b b b b b b d d d d d d d d d d d d c c c c c c c c c c c
Similarity
Similarity
b b b b b b b b b a a a b a a b e e e e e e e d d d d d d d d d d d d c c c c c c c c c c c
-0.14
-0.34
-0.14
0.06
Similarity
0.06
Similarity
0.26
0.26
0.46
0.46
0.66
0.86
0.86
e e e e e e e d d d d d d d d d d d d a a a b a a b b b b b b b b b b c c c c c c c c c c c
e e e e e e e d d d d d d d d d d d d a a a b a a b b b b b b b b b b c c c c c c c c c c c
0.66
a a a b a a b b b b b b b b b b c c c c c c c c c c c e e e e e e d d d d d d d e d d d d d
For this example, a centroid link best reects what we already know about the data.
Iris dataset
A pretty famous data set published by R.A. Fisher, The Use of Multiple Measurements in Axonmic Problems. Anals of Eugenics, 7, 179-188 (1936). He measured four physical properties of iris to see if they could be used to classify any of three different species. Used length and width of the sepal and petal.
Iris dataset
! ! ! ! ! Species I. Setosa I.Versicolor I.Verginica Property Petal width Petal length Sepal width Sepal length
150 samples - no missing values HCA analysis was conducted on both raw and scaled data. Both single linkage and complete linkage were evaluated.
Iris dataset
25
Raw data.
20 P e t a l w i d t h
30
Dissimilarity
25
15
20
15
10
10
5
15 30 Petal length
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 3 3 3 3 3 3 3 2 3 3 3 2 3 2 2 2 2 2 2 2 3 2 2 3 2 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
So it should be possible to classify samples. HCA just does not provide as useful a view as we had hoped for.
45 60
Iris dataset
So there was useful information in the dataset. HCA - not a good tool. Reducing the four measurements into a single one actually make the data worse. Autoscaling - had little or no effect. The actual numbers were all of a similar range. Moral - just because a method doesnt work does not mean that there is no useful information.
Classication of Mycobacteria
Investigators at the CDC wanted to see if it was possible to identify mycobacteria using pattern recognition of an HPLC analysis of mycolic acids. Mycobacteria - include a number of respiratory and non-respiratory pathogens such as M. tuberculosis. C70-C90 -branched -hydroxy mycolic acids were selected as they are known to be in the cell walls of these bacteria.
Classication of Mycobacteria
Eight species were investigated. ! M. asiaticum M. bova ! M. gastri M. gordonae ! M. kansasii M. marinum ! M. szulgai M. tuberculosis 22 mycolic acids were used for the classication. 175 total samples.
Classication of Mycobacteria
Limitation. Although the paper specied that it was necessary to normalized the data to account for variations in sample size, no standards were provided. I chose to normalize to the total peak areas for each sample. This assumes that each species produces about the amount of total mycolic acid and that the response/concentration is the same for each component.
Single linkage
Single linkage shows some clustering of the samples but is not very useful.
Complete linkage
Complete linkage gives some what better results. Well look at this sample again later using other tools.
Identication of Coffee
An attempt was made to identify the source of coffee beans. Sulawesi Costa Rica Ethiopia Sumatra Kenya Columbia Method. Mass spectral analysis of headspace of bean samples. m/e range of 47-99 was used. Six samples were obtained from each source.
Identication of Coffee
The mass spectra represented the sum of spectra for all components present. As is normal with mass spectra, each was normalized to the largest peak. Only raw data was evaluated.
Single linkage
Sumatra Columbia Sulawesi Costa Rica
Columbia
Kenya
Ethiopia
Sumatra
Sulawesi
Costa Rica
Kenya Ethiopia
Complete linkage
Kenya Ethiopia Columbia Sumatra Sulawesi Costa Rica
k-mean clustering
An iterative method where samples are initially partitioned into k classes and a centroid calculated. Must use quantitative variables but can be raw, scaled or PCA based. The positions of all samples are then calculated relative to the centroids and then reassigned to new clusters (if needed) and the process repeated. Classication criteria can include within-class variance, pooled covariance matrix or total inertia matrix. The number of clusters and assignments can vary based on the initial starting points so several iterations are commonly used to nd a constant solution.
k-mean clustering
Adjust centroids
Retest/repeat
Using XLStat
Classication criteria that can be minimized.
Using XLStat
Trace.
Minimize the within-class variance, giving the most homogeneous clusters. Data should be autoscaled if this is used. Minimize the covariance matrix. More appropriate to use with unscaled data but gives less homogeneous clusters. approach. Normalized version of the Determinate
Determinant.
Hierarchical Clustering - AHC) will do a k-mean analysis but only the trace method
The k-means option provides more clustering However, AHC has an option to allow the routine
to automatically set the number of clusters that appear to exist.
Wilks lambda.
25
Raw data.
20 P e t a l w i d t h
15
10
15
30 Petal length
45
60
Arson dataset.
Here are the nal class results from the k-mean clustering.
Centroidal, Scaled
-0.34
-0.14
0.06
Similarity
0.26
0.46
0.66
0.86
e e e e e e e d d d d d d d d d d d d a a a b a a b b b b b b b b b b c c c c c c c c c c c
20
40
60
80
100
120
140
Mycobacteria
This data set was VERY difcult to visualize using a dendrogram.
1000 900
800
700
600
500
400
300
200
100
0
46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 46 46 46 44 44 44 44 44 47 44 44 44 44 44 44 44 44 44 44 43 44 43 43 43 43 43 43 43 43 43 43 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 43 43 43 43 43 43 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 43 43 43 43 43 43 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 43 43 45 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47
Mycobacteria
Mycobacteria - autoclustering.