You are on page 1of 40

Institut f. Statistik u.

Wahrscheinlichkeitstheorie

Cluster analysis applied to regional geochemical


data:
Problems and possibilities
M. Templ, P. Filzmoser, and C. Reimann

Forschungsbericht CS-2006-5
Dezember 2006

1040 Wien, Wiedner Hauptstr. 8-10/107


AUSTRIA
Email: P.Filzmoser@tuwien.ac.at
http://www.statistik.tuwien.ac.at

CLUSTER ANALYSIS APPLIED TO REGIONAL GEOCHEMICAL


DATA: PROBLEMS AND POSSIBILITIES
Matthias Templ1,2, Peter Filzmoser1 and Clemens Reimann3
1

Department of Statistics and Probability Theory, Vienna University of Technology, Wiedner


Hauptstr. 8-10, A-1040 Wien, Austria.
Email: P.Filzmoser@tuwien.ac.at, Tel.: +43 1 58801 10733
2
Department of Register, Classification and Methodology, Statistics Austria,
Guglgasse 13, A-1040 Wien, Austria.
Email: Matthias.Templ@statistik.gv.at, Tel.: +43 1 71128 7327
3
Geological Survey of Norway, N-7491 Trondheim, Norway.
Email: Clemens.Reimann@ngu.no, Tel.: +47 73 904 321

ABSTRACT
A large regional geochemical data set of O-horizon samples from a 188,000 km2 area in the
European Arctic, analysed for 38 chemical elements, pH, electrical conductivity (both in a
water extraction) and loss on ignition (LOI, 480 oC), was used to test the influence of
different variants of cluster analysis on the results obtained. Due to the nature of regional
geochemical data (neither normal nor log-normal, strongly skewed, often multi-modal data
distributions), cluster analysis results usually strongly depend on the clustering algorithm
selected. Deleting or adding just one element (variable) in the input matrix can also
drastically change the results of cluster analysis. Different variants of cluster analysis can lead
to surprisingly different results even when using exactly the same input data. Given that
selection of elements is often based on availability of analytical packages (or detection limits)
rather than on geochemical reasoning this is a disturbing result. Cluster analysis can be used
to group samples and to develop ideas about the multivariate geochemistry of the data set at
hand. It should not be misused as a statistical "proof" of certain relationships in the data. The
use of cluster analysis as an exploratory data analysis tool requires a powerful program
system, able to present the results in a number of easy to grasp graphics. In the context of this
work, such a tool has been developed as a package for the R statistical software.

KEY WORDS: Kola Peninsula, O-horizon, cluster analysis, exploratory data analysis, R

1. INTRODUCTION
The principal aim of cluster analysis is to partition observations into a number of groups. A
good outcome of cluster analysis will result in a number of clusters where the observations
within a cluster are as similar as possible while the differences between the clusters are as
large as possible. Cluster analysis must thus determine the number of classes as well as the
memberships of the observations to the groups. To determine the group membership most
clustering methods use a measure of similarity between the observations. The similarity is
usually expressed by distances between the observations in the p-dimensional space of the
variables.
Cluster analysis was developed in taxonomy. The aim was originally to get away from the
high degree of subjectivity when single taxonomists performed a grouping. Since the
introduction of cluster analysis techniques there has been controversy about its merits (see
Davis, 1973 or Rock, 1988 and references there). It was soon discovered that diverse
techniques can yield different groupings, even when using exactly the same data.
Furthermore the addition (or deletion) of just one variable in a cluster analysis can lead to
completely different results. Workers may thus be tempted to experiment with different
techniques and the selection of variables entered until the results of a cluster analysis fits their
preconceived ideas. Cluster analysis is still a popular technique, in part because as a
complicated statistical technique it appears to add a scientific component to a publication.
Readers of papers using cluster analysis should be very aware of the problems cluster
analysis can be applied as an "exploratory data analysis tool" to better understand the
multivariate behaviour of a data set. It can, however, never be a "statistical proof" of a certain
relationship between the variables or observations.
While factor analysis (Reimann et al., 2002) uses the correlation matrix for extracting
common "factors" from a given data set most cluster analysis techniques use distance
measures to assign observations to a number of groups. Correlation coefficients lie between
1 and +1, with 0 indicating linear independence. Distance coefficients lie between 0 and ,
with 0 indicating complete identity (Rock, 1988). The use of correlation coefficients requires
not only a normal, but even a multivariate normal, distribution for all the input data (Reimann
et al., 2002). This condition is almost never fulfilled when working with geochemical data
(Reimann and Filzmoser, 2000). Furthermore geochemical data are "closed" data
(compositional data expressed in units like wt.-% or mg/kg, summing up to a constant (100,
1000, 1,000.000)) and multivariate statistical methods may thus deliver biased results (Le
2

Maitre, 1982, Aitchison, 1986). The use of distance coefficients does a priori not make any
statistical assumptions about the data (except if the data are of categorical order), theoretically
an ideal situation when working with geochemical data. Distance measures will also be
essential for cluster validation, i.e. measuring the quality of a clustering. In theory, it should
be ideal to first use cluster analysis on a large geochemical dataset to extract more
homogenous data subsets (groups) and to then perform factor analysis or discriminant
analysis on these homogenous data subsets to study their multivariate data structure.
Especially for data sets with many variables, it has been suggested (e.g. Everitt, 1974) to first
use principal component analysis to reduce the dimensionality of the data and to then perform
cluster analysis on the first few principal components. This approach was criticised because
clusters embedded in a high-dimensional variable space will not be properly represented by a
smaller number of orthogonal components (e.g. Yeung and Ruzzo, 2001).
Clustering methods also exist that are not based on distance measures, like model-based
clustering (Fraley and Raftery, 1998). These techniques usually find the clusters by
optimising a maximum likelihood function. The implicit assumption is that the data points
forming the single clusters are multivariate normally distributed, and the algorithm tries to
estimate the parameters from the normal distribution as well as the membership of each
observation to each cluster.
With geochemical data cluster analysis can be used in two different ways: it can be used to
cluster the variables (e.g. to detect geochemical relations between the variables) and it can be
used to cluster the observations (e.g. to assign soil samples to certain parent materials) to
come to more homogenous data subsets for further data analysis.
Here we will apply a variety of different methods of cluster analysis to geochemical data
from a large regional scale geochemical data set containing 617 observations and 40
variables. The objective of this study was to investigate:
- Has the point where cluster analysis can (and should) be applied to such a high-dimensional
data set been reached? If yes, what are the prerequisites for applying cluster analysis to such
data sets?
- What are the results of cluster analysis when such a massive data set is investigated?
- What is the influence of the actual method used and is there an ideal method for regional
geochemical data?
- Is there an objective way to determine the optimum number of clusters extracted?

- Is an objective decision on the number and choice of elements entered into the cluster
analysis possible?
- Which distance measures are most suitable for distance based clustering methods?
- Is there a graphical way to evaluate the stability of clusters?
- Can objective, reliable statistically significant results be obtained that can provide proof of a
hypothesis or explain the multivariate relation between the elements or observations, or is
cluster analysis rather an exploratory data analysis tool that should only be used to generate
ideas (the proof needs to come from elsewhere)?
The paper is organised as follows: Section 2 gives a detailed description of the example data
set. Data problems for cluster analysis are discussed in Section 3. Sections 4-8 are devoted to
the methodology of cluster analysis. For rapid cluster analysis and plotting of the results the
package clustTool running under R was developed (see Section 9). All algorithms are thus
easily available via the internet (e.g. at the R project site: www.r-project.org). Results and
possibilities for graphical presentations of the results are shown in Section 10. The final
Section 11 concludes.
2. MATERIAL AND METHODS
THE KOLA PROJECT
From 1992-1998 the Geological Surveys of Finland (GTK) and Norway (NGU) and Central
Kola Expedition (CKE), Russia, carried out a large, international multi-media, multi-element
geochemical mapping project, covering 188,000 km2 north of the Arctic Circle. The entire
area between 24o and 35.5oE up to the Barents Sea coast (Fig. 1) was sampled during the
summer of 1995. Results of the Kola Ecogeochemistry project are documented on a web
site (http://www.ngu.no/Kola) and in a geochemical atlas (Reimann et al., 1998). One sample
material for the project was the O-horizon developed on top of Podsol profiles, representing
the interplay of pedosphere, atmosphere and biosphere and as such reflecting surface
processes ranging from natural (i.e. input of sea salts, influence of vegetation zones) to
industrial contamination. The average sample density was 1 site per 300 km2. Detailed maps
of the geology, quaternary geology, topography, vegetation zones and climatic conditions in
the survey area can be found in Reimann et al. (1998).

Figure 1: General location map of the study area for the Kola Project (Reimann et al., 1998).
Locations named in the text are given.

While the western part of the project area (N-Finland and Norway) is almost pristine, the
Russian part is heavily industrialised. This includes several important mines, e.g. Cu/Ni ores
are mined near Zapoljarnij, Fe-ores near Olenegorsk, Apatite near Apatity (Fig. 1), and
related mineral processing plants. In terms of environmental impact the most important ore
roasters and smelters, responsible for major emissions of Cu, Ni, Co, V and many other
metals (Reimann et al., 1998), are situated near Zapoljarnij, Nikel and Monchegorsk (see Fig.
1).
SAMPLING, SAMPLE PREPARATION AND ANALYSES
A detailed description of sample site selection criteria and sample methods is given in yrs
and Reimann (1995) and in Reimann et al. (1998). The O-horizon was sampled from the
uppermost 3 cm of the organic horizon (usually litter), avoiding living vegetation, using a
special tool (see Reimann et al., 1998) as a composite sample from a 50 x 50 m area
surrounding a complete podzol soil profile. A field duplicate of all samples was taken, some
100 m distant, at every 15th site.
Analytical procedures and all analytical results are detailed in Reimann et al. (1998). Quality
control procedures followed the methods suggested in Reimann and Wurzer (1986) and
results are documented in Reimann et al. (1998). The O-horizon was collected at 617 sites. A
summary of the elements of the O-horizon is given in Table 1. The samples were air dried
and sieved to < 2 mm using nylon screening. Carbon, hydrogen and nitrogen were determined
using a CHN-analyser according to ISO standard 10694. Electrical conductivity and pH were
determined in a water extraction. To obtain total element concentrations in the organic
fraction 0.4 g of sample were digested with 10 ml of concentrated nitric acid. This extract was
analysed by ICP-AES, inductively coupled plasma mass spectrometer (ICP-MS) and GFAAS
for 36 elements (see Niskavaara, 1995).
Table 1: Elements and summary statistics (minimum (MIN), median (MED), maximum
(MAX) and spread (expressed as median absolute deviation MAD) for the Kola O-horizon
data set used here (from Reimann et al., 1998). In addition the detection limit (DL) and the
number of samples below detection (expressed in %) are given.
3. POSSIBLE DATA PROBLEMS IN THE CONTEXT OF CLUSTER ANALYSIS
MIXING MAJOR, MINOR AND TRACE ELEMENTS

In multi-element analysis of geological materials one usually deals with elements occurring in
very different concentrations. In rock geochemistry, the chemical elements are divided into
"major", "minor" and "trace" elements. Major elements are measured in % or tens of %,
minor elements are measured in about 1 % amounts, and trace elements are measured in ppm,
or even ppb. This may become a problem in multivariate techniques that consider all
variables simultaneously because the variable with the greatest variance will have the greatest
influence on the outcome. Variance is obviously related to absolute magnitude. As one
consequence, one should not mix variables quoted in different units in one and the same
multivariate analysis (Rock, 1988). Transferring all elements to just one unit (e.g. mg/kg) is
not an easy solution to this problem, as the major elements occur in much greater amounts
than the trace elements. To enter geochemical raw data, including major, minor and trace
elements into cluster analysis does not make sense because it can be predicted that the minor
and trace elements would have almost no influence on the result. The same even applies to
the major elements: if C (or LOI as a proxy for "organic content") is entered together with the
other major elements it will completely govern the clustering just because of its much greater
concentrations. The data matrix will thus need to be "prepared" for cluster analysis using
appropriate data transformation and standardisation techniques.
DATA OUTLIERS
Regional geochemical data sets practically always contain outliers. The outliers should not
simply be ignored but they have to be accommodated because they contain important
information about data quality and unexpected behaviour in the region of interest. In fact,
finding data outliers that maybe indicative of mineralisation (in exploration geochemistry) or
of contamination (in environmental geochemistry) are one of the major aims of geochemical
surveys. Outliers can have a severe influence on cluster analysis, because they can affect
proximity measures and obscure clustering tendencies. Outliers should thus be removed prior
to entering a cluster analysis or statistical clustering methods capable of handling outliers
should be used. This is rarely done. Finding data outliers is not a trivial task, especially in
high dimensions. One way of identifying such outliers is to compute robust Mahalanobis
distances, i.e. Mahalanobis distances on the basis of robust estimates of location and scatter
(Filzmoser et al., 2005).
CENSORED DATA
There is a further problem that often occurs when working with geochemical data: the
detection limit problem. For some determinations a proportion of all results are below the
lower limit of detection of the analytical method, i.e. the data are censored. For statistical

analysis these results are often set to a value of the detection limit. However, a sizeable
proportion of all data with an identical value can seriously influence any cluster analysis
procedure. For the study datasets several variables had more than 25% of the data below
detection. It is very questionable as to whether or not such elements should be included at all
in a cluster analysis. Unfortunately it is often the elements of greatest interest that contain the
highest number of censored data (e.g., Se see Tab. 1) the temptation to include these in a
cluster analysis is thus high. Here all elements with more than 5% of all values below
detection have been omitted from cluster analysis (Be and Se see Tab. 1).
DATA TRANSFORMATION AND STANDARDISATION
Cluster analysis in general does not require that the data be normally distributed. However, it
is advisable that heavily skewed data are first transformed to a more symmetric distribution.
If a good cluster structure exists for a variable, we can expect a distribution, which has two or
more modes. A transformation to more symmetry will preserve the modes but remove large
skewness.
Most geochemical textbooks still claim that for geochemical data a log-transformation is most
suitable. Recently Reimann and Filzmoser (2000) have shown that very few geochemical
variables will indeed follow a (log)-normal distribution. Each single variable needs,
unfortunately, to be considered for transformation and different transformations, with the
Box-Cox transfomation (Box and Cox, 1964) being the most universal choice, need to be
considered. The most practical decision guide whether to transform or not and how to
transform should be the data distribution: it should be close to symmetry prior to entering
cluster analysis. Even Box-Cox transformations of all single variables do not guarantee
symmetry of the resulting multivariate data distribution, but more closeness to symmetry (or
removal of strong skewness) will in general improve the cluster results.
An additional standardisation is needed if the variables show a striking difference in the
amount of variability (see discussion above, major, minor and trace elements). Different
methods, all having advantages and disadvantages, exist to accommodate this requirement.
The most universal method is the z-transfomation, which builds on the mean and standard
deviation of the data. When working with geochemical data a robustified version, using
median and MAD should be preferred.

4. DISTANCE MEASURES

A key issue in most cluster analysis techniques is how best to measure distance between the
observations (or variables). Note that "distance" in cluster analysis has nothing to do with
geographical distance between two observations but is rather a measure of similarity between
observations in the multivariate space defined by the entered variables. Many different
distance measures exist (Bandemer and Nther, 1992). Modern software implementations of
cluster algorithms can accommodate a variety of different distance measures because the
distances rather than the data matrix are taken as input, and the algorithm is applied to the
given input.
For clustering the observations the Euclidean distance or the Manhattan distance is the most
frequent choice. The latter measures the distance parallel to the variable axes, rather than
directly (Euclidean), and the cluster results are sometimes more stable. Usually both distance
measures lead to comparable results. Other distance measures like the Gower distance
(Gower, 1966), the Canberra distance (Lance and Williams, 1966), correlation based distance
measures or a distance measure based on the random forest proximity measure (Breiman,
2001) can give completely different cluster results.
To demonstrate the effect of the distance measure used for clustering geochemical data the
average linkage clustering algorithm (see below) was applied to the Kola O-horizon data,
using all 40 variables log-transformed and standardised with less than 5% of values below the
detection limit. The results were retained for a fixed number of clusters (here 6 clusters were
always sought) for reasons of comparability. It is desirable that a similar data set will give
approximately the same cluster result. Therefore,
1. Bootstrap samples from the original data (sample with replacement) are drawn;
2. The bootstrap samples are clustered with the same method and the same number of
clusters and extracted;
3. The results are compared to the cluster results obtained from the original data using
the adjusted Rand index (Hubert and Arabie, 1985) as a measure of similarity.
Figure 2 shows boxplots of the resulting Rand index of the clustered bootstrap samples. A
high value of the Rand index indicates very similar results, a low value means completely
different results. For this example, the Euclidean, Manhattan, Gower and Canberra distances
lead to stable cluster results whereas the random forest distance yields highly unstable
clusters.

Figure 2: Average linkage clustering for 40 variables of the O-horizon data (log-transformed,
standardised). The cluster results for different methods and a fixed number of clusters are
compared with the corresponding results for bootstrap samples of the data using the Rand
index. The boxplots show the resulting Rand indices.
Similar simulations were also undertaken for different numbers of clusters, for other
clustering algorithms, and for other distance measures. The conclusion was that Euclidean
and Manhattan distance measures gave the most stable clusters.

5. CLUSTERING OBSERVATIONS
One of the main problems with cluster analysis is that a multitude of different clustering
methods exists. The observations need to be grouped into classes (clusters). If each
observation is allocated into only one (of several possible) cluster(s) this is called
"partitioning". Partitioning will result in a pre-defined (user defined) number of clusters. It is
also possible to construct a hierarchy of partitions, i.e. group the observations into 1 to n
clusters (n = number of observations). This is called hierachical clustering. Hierachical
clustering always delivers n cluster solutions, and based on these solutions the user has to
decide which result is most appropriate.
Two principally different procedures exist. An observation can be allocated to just one cluster
(hard clustering) or be distributed among several clusters (fuzzy clustering). Fuzzy clustering
allows that one observation belongs to a certain degree to several groups. In terms of applied
geochemistry this procedure will often deliver the more interesting results because it reveals
if one observation is influenced by several factors. The cluster solution will then show to
what degree the observations are influenced by the different factors. Here the factors or
processes are represented by observations that are clustered together in the data space.
HIERARCHICAL METHODS
Input to most hierarchical clustering algorithms is a distance matrix (distances between the
observations). The widely used agglomerative techniques start with single object clusters
(each observation forms an own cluster) and enlarge the clusters stepwise. The
computationally more intensive reverse procedure starts with one cluster containing all
observations and splits the groups step by step. This procedure is called divisive clustering.

At the beginning of an agglomerative algorithm each observation forms its own class, leading
to n single object clusters. The number of clusters is reduced by one by combining (linking)
the most similar classes at each step of the algorithm. The similarity of the combined pair, a
new class, can be measured to all other classes, and the next two most similar classes linked,
and so on. At the end of the process there is only one single cluster left, containing all
observations. A number of different methods are available for linking two clusters. Best
known are average linkage, complete linkage and single linkage. The average linkage method
considers the averages of all pairs of distances between the observations of two clusters. The
two clusters with the minimum average distance are combined into one new cluster.
Complete linkage looks for the maximum distance between the observations of two clusters.
The clusters with the smallest maximum distance are combined. Single linkage considers the
minimum distance between all observations of two clusters. The clusters with the smallest
minimum distance are linked. Single linkage will result in cluster chains because for linkage
it is sufficient that only two objects of different clusters are close together. Complete linkage
will result in very homogenous clusters in the first stages of agglomeration, however the
resulting clusters will be small. Average linkage is a compromise between the two other
methods and usually performs best in typical applied geosciences applications.
Because the cluster solutions grow tree-like (starting with the roots and ending upwards with
the trunk) results are often displayed in a graphic called the dendrogram (see Fig. 4, 5 for
clustering variables). Horizontal lines indicate the linkage of two objects or clusters, and thus
the vertical axis presents the associated height or similarity as a measure of distance. The
objects are arranged in such a way that the branches of the tree do not overlap. Linking of two
groups at a large height indicates strong dissimilarity (and vice versa). Therefore, a clear
cluster structure would be indicated if observations are linked at a very low height, and the
distinct clusters are linked at a considerably higher value (long roots of the tree). Cutting the
dendrogram at the height corresponding to this visible number of clusters allows assigning
the objects to the clusters. Visual inspection of a dendrogram is often helpful in obtaining an
initial idea of the number of clusters to be generated by a partitioning method.
PARTITIONING METHODS
In contrast to hierachical clustering methods partitioning methods require that the number of
resulting clusters be pre-determined. As noted above, when nothing is known about the
observations it can be useful to first carry out a hierachical clustering. The other possibility is
to partition the data into different numbers of clusters and evaluate the results (see below).
For regionalised data a more subjective but still reasonable approach to evaluation is to

10

visually inspect the location of the resulting clusters in a map. This exploratory approach can
often reveal interesting data structures.
A very popular partitioning algorithm is the k-means algorithm. It attempts to minimise the
average squared distance between the observations and their cluster centres or centroids.
Starting from k initial cluster centroids (e.g. random initialisation by k observations), the
algorithm assigns the observations to their closest centroids (using e.g. Euclidean distances),
recomputes the cluster centres, and iteratively reallocates the data points to the closest
centroid. Several algorithms exist for this purpose, those of Hartigan (1975) and MacQueen
(1967) are the most popular. There are also some modifications of the k-means algorithm.
Manhattan distances are used for k-medians and the centroids are the medians of each cluster.
Hard competitive learning works by randomly drawing an observation from the data and
moving the closest centre towards that point (e.g., Ripley, 1996). Martinetz et al. (1993) have
introduced "neuralgas", this method is similar to hard competitive learning, but in addition to
the closest centroid also the second closest centroid is moved at each iteration. A new high
extensible toolbox for centroid clustering was recently implemented in R (Leisch, 2006).
Here the user can easily try out almost any arbitrary distance measure and centroid
computations for data partitioning.
Kaufmann and Rousseeuw (1990) proposed several clustering methods which are
implemented in a number of software packages. The partitioning method PAM (partitioning
around medoids) minimises the average distances to the cluster medians. It is thus similar to
the k-medians method but allows the use of different distance measures. A similar method
called CLARA (Clustering Large Applications) is based on random sampling. It saves
computation time and is particularly appropriate for larger data sets.
The result of all these algorithms depends on the initial cluster centres, which are often a
random selection of k of the observations. If bad initial cluster centres are selected, the
iterative partitioning algorithms can lead to a local optimum that can be far away from the
global optimum. This can be avoided by applying the algorithms with different random
initialisations, and then selecting the best (according to a validity measure, see below) or most
stable result.
Another way to approximate the global optimum is bootstrap aggregation, called bagging
(Breiman, 1996). This bootstrap method generates new data sets from the available data set of
the same size by a random selection of observations with replacement from the data set. The

11

central idea of the bagged clustering algorithm bclust (Leisch, 1998, 1999) is to repeatedly
apply a clustering algorithm (e.g. k-means) on bootstrap data sets, combine the resulting
centroids to a new data set, run a hierarchical clustering algorithm on this new data set and cut
the resulting dendrogram to get a partition into k clusters. The observations are then assigned
to the closest centre.
MODEL-BASED METHODS
A method that is not based on distances between the observations but on certain models
describing the shape of the clusters is called model-based clustering (Fraley and Raftery,
2002). The Mclust algorithm selects the cluster models (e.g. elliptical cluster shape) and the
number of clusters and determines the cluster memberships of all observations. The
estimation is achieved using the Expectation-Maximization (EM) algorithm (Dempster et al,
1977). The EM algorithm is executed on several numbers of clusters and with several sets of
constraints on the covariance matrices of the clusters. Finally, the combination of model and
number of groups that leads to the highest BIC (Bayesian Information Criterion) value can be
chosen as the optimal model (Fraley and Raftery, 1998). The BIC value can also be computed
for each cluster separately.
FUZZY METHODS
In fuzzy clustering, the observations are not clearly allocated to one of the clusters, but they
are distributed in certain amounts among all clusters. Thus, for each observation a
membership coefficient to all clusters is determined, providing information on how strong the
observation is associated with each cluster. The membership coefficients are usually
transformed to the interval [0,1], and they can be visualised for example by using a grey
scale. A popular fuzzy clustering algorithm is the fuzzy c-means (FCM) algorithm, developed
by Dunn (1973) and improved by Bezdek (1981), which calculates the prototypes, most
typical group characteristics, of the clusters and membership coefficients for each observation
to the clusters. Another fuzzy clustering algorithm is the Gustafson-Kessel (GK) algorithm
(Gustafson and Kessel, 1979). While FCM identifies clusters that tend to be rather spherical,
GK is able to detect elliptically shaped clusters, because the FCM algorithm replaces the
Euclidean distance by the Mahalanobis distance. The Gath-Geva (GG) algorithm (Gath and
Geva, 1989), also called the Gaussian mixture decomposition algorithm, is even more
flexible. It is an extension of the GK algorithm which can also deal with different cluster
sizes and densities. The GK and the GG algorithms are freely available at http://www.fuzzyclustering.de. Just as for partitioning methods, the number of clusters resulting from fuzzy
clustering needs to be chosen by the user.

12

6. CLUSTERING VARIABLES
Instead of clustering the observations it is also possible to cluster the variables in order to find
groups of variables that show similar behaviours. All of the methods discussed above can be
used for clustering the variables. One of the best methods to display the results of clustering
variables is the dendrogram (see Fig. 4, 5), calling for hierachical clustering.
7. EVALUATION OF CLUSTER VALIDITY
Because there is no universal definition of clustering, there is no universal measure with
which to compare clustering results. However, evaluating cluster quality is essential since any
clustering algorithm will produce some clustering for every dataset. Validity measures should
support the decision as to the number of natural clusters, and they should also be helpful for
evaluating the quality of the individual clusters. Therefore, validity measures should provide
a value for each single cluster, and they should also return a value for judging the quality of
the overall clustering result.
As mentioned above, a rather simple method of evaluating quality of clustering for
regionalised data is to check the distribution of the resulting clusters on a map. The
distribution of the clusters can then be evaluated against known properties of the survey area.
It is also likely that clusters resulting in geographically homogeneous subgroups are more
likely to have a meaning than clusters resulting in "geographical noise".
There are many different statistical cluster validity measures. Two different concepts of
validity criteria external and internal criteria need to be considered.
External criteria compare the partition found with clustering with a partition that is known a
priori. The most popular external cluster validity indices are Rand, Jaccard, Folkes and
Mallows, and the Hubert indices (see e.g., Gordon, 1999, or Haldiki et al, 2002).
Internal criteria are cluster validity measures which evaluate the clustering result of an
algorithm by using only quantities and features inherent in the data set. Most of the internal
validity criteria are based on within cluster sum of squares and between cluster sum of
squares. Well known indices are the Calinski-Harabasz index (Calinski and Harabasz, 1974),

13

Hartigan's indices (Hartigan, 1975), or the Average Silhouette Width of Kaufman and
Rousseeuw (1990).
From a practical point of view, an optimal value of the validity measure does not imply that
the resulting clusters are meaningful. Some of these criteria evaluate only the allocation of the
data to the clusters. Other criteria evaluate the form of the clusters or how well the clusters
are separated. The resulting clusters only correspond to the best partition according to the
validity measure selected. The measures deliver good results when a very clear cluster
structure exists in the data. Unfortunately, when working with geochemical data such good
clusters are rare. Thus cluster quality measures fail time and again when working with such
data and the best approach to evaluating cluster quality is often by just looking at the results
in a map.
Instead of visual inspection of the cluster results in a map, the resulting clusters could be
evaluated according to the geographical coordinates. Since the validity measure would be
optimised with compact clusters, the compactness of the clusters in the map would be judged.
However, our experience with this approach mostly results in the selection of the same
optimal number of clusters as with the decision when calculating the validity measures on the
data used for clustering. Nevertheless, when calculating validity measures using geographical
coordinates the choice of the optimal number of clusters is easier in most of the cases,
because the characteristic of the validity measures is more distinctive.
Figure 3 shows a plot of validity measures resulting from clustering the log-transformed and
scaled O-horizon data (40 variables). A variety of clustering algorithms with different
distance measures were applied. The evaluation of the performance of the different
algorithms was made with the most simple validity measure, namely the average within
cluster sum of squares divided by the average between cluster sum of squares. Figure 3 shows
plots of this measure against the number of clusters. Small values are preferable because this
indicates homogeneous and well separated clusters. Typically, the optimal number of clusters
is indicated by a knee in the plot. One should select that cluster number before or after the
validity measure increases significantly. For example, according to Figure 3 the optimal
number of clusters for k-means clustering using the Euclidean distance is 5. Note that the
algorithm Mclust (Fig. 3 right) is not distance based, therefore only one graph can be plotted.
Figure 3: Resulting validity measures of clustering the 40 variables of the O-horizon data
(log-transformed, standardised) with different methods and based on different distance

14

measures (rf is the random forest distance). The validity measure changes with varying
number of clusters. The optimal number of clusters is indicated by a knee before or after the
measure increases significantly.
It is obvious that the graphs in Figure 3 do not give an unequivocal answer on the optimal
number of clusters. Even for a single clustering algorithm the optimal number of clusters
strongly depends on the distance measure chosen. When inspecting the resulting clusters on a
map often interesting structures appear, even in instances where the indicated optimal
number of clusters was chosen. This somewhat disappointing result demonstrates that cluster
analysis should only be used in an exploratory manner, by varying parameters in the
algorithm and looking at the results on maps.

8. SELECTION OF VARIABLES FOR CLUSTER ANALYSIS


When using a multivariate technique, variable selection is often employed in order to reduce
the dimensionality of a data set or to learn something about the internal structure between the
variables and/or observations. Often it may appear desirable to perform cluster analysis with
all available observations and variables. However, the addition of only one or two irrelevant
variables can have drastic consequences in identifying the clusters. The inclusion of only one
irrelevant variable may be enough to hide the real clustering in the data (Gordon, 1999). The
selection of the variables to enter into a cluster analysis is thus of considerable importance
when working with applied earth science data sets containing a multitude of variables.
Another reason for variable selection may be a desire to focus the analysis on a specific
geochemical process. Such a process is usually expressed by a combination of variables, and
using these variables for clustering permits identifying those observations or areas where the
process is either present or non-existent. The variables could simply be chosen based on
expert knowledge. It is also possible to apply variable clustering (see above) and select either
variables which are in close relation (one branch of the cluster tree) to highlight a certain
process, or to study it in more detail.
Variable clustering can of course also be used to select single key variables from each
important cluster to simply reduce dimensionality for clustering observations.

15

As an example for variable clustering the 40 elements of the Kola O-horizon data set were
clustered. Following a log-transformation and standardisation of the data, average linkage
based on the Manhattan distances between the variables was applied resulting in the
dendrogram in Figure 4. Variables that are in close relation form a tree in the dendrogram.
Thus, when focusing on certain processes, variables in a tree can be selected for the further
clustering of observations. For example, the elements Cu and Ni, followed by Co are the
three main elements emitted by the Russian smelters. Traces of As and Mo are also emitted
and a sizable Mo-anomaly is surrounding Monchegorsk (Reimann et al., 1998). For As and
Mo the emissions from industry appear to be the main process determining their regional
distribution in the survey area and they thus link on the Cu, Ni and Co-branch. Interestingly
V, which is also an important part of the industrial emissions, links on the Cr and Al, Fe, Sc,
La, Y, Be, Th, U-branch which signifies the content of minerogenic material in the organic
soil. Here two important processes appear in the survey area: the input of dust from industrial
emissions and the input of geogenic dust. On the regional scale the latter is the more
important process than the anthropogenic impact even for an element like V. In contrast the
regional distribution of Mg, B, Na, Ca, Sr and the pH is dominated by the input of sea spray
along the coast of the Barents Sea, while C, H, LOI, S, N and P and Hg are all related to the
amount of organic material in the sample. That they appear linked to the sea spray elements at
a high level may be caused by the fact that the organic material decays more slowly near
coast due to the wet climate. The Mn, Zn, electrical conductivity, K, Rb-branch is primarily a
plant nutrient status indicator. Plants near Monchegorsk and Nikel/Zapoljarnij have a poor
nutrient status and this may be the reason that this branch links to the "contaminants" branch.
The regional distribution of Pb, Bi, Tl, Cd, Ag, Sb, Ba is dominated by the three main
vegetation zones observed in the survey area (subarctic tundra, subarctic birch forest, boreal
forest) and some of these elements are also emitted at the smelters, explaining the location of
this branch near plant nutrients and major contaminants.
Figure 4: Clustering the 40 variables of the O-horizon data (log-transformed, standardised)
using average linkage based on the Euclidean distance between the variables. The result is
shown by a dendrogram. Variables on a tree can be selected for further clustering of the
objects.
However, if single linkage is used instead of average linkage the dendrogram in Figure 5
emerges, which is substantially different from the dendrogram in Figure 4 and would suggest
a number of different relations. Although some of the key results are the same (e.g., the
existence of a contamination (Cu, Ni and Co) and a sea spray (Mg, B, Na, pH) branch),

16

several elements enter other branches than in Figure 4. The dendrogram displayed in Figure 5
would thus lead to a substantially altered interpretation of the behaviour of some of the
elements. For example, Pb, Bi, Tl and Rb do now enter the contamination branch, the result
in the average linkage diagram backs the long working experience of the authors with the
Kola dataset. If somebody wanted to "prove" that these elements are an important ingredient
of the contamination spectrum of the smelters, the single linkage dendrogram would probably
be used. A reader (or reviewer) of a paper using cluster analysis has practically no chance to
judge whether the authors have just "played" long enough with cluster analysis until a result
emerged that fitted preconceived conceptions. The example demonstrates that cluster analysis
can never be a statistical proof of the existence of relations it can, however, be used in a
truly exploratory data analysis sense to detect structures in the data, that are worth-while
further investigation.
Figure 5: Clustering the 40 variables of the O-horizon data (log-transformed, standardised)
using single linkage based on the Euclidean distance between the variables. The result is quite
different from Figure 4.
For reducing the number of elements entered to cluster analysis one could then select just one
or two elements from each of the major branches of a dendrogram, or study the elements on
just one branch in more detail.

9. A TOOL FOR THE EXPLORATORY USE OF CLUSTER ANALYSIS


It has been shown above that the cluster results can change dramatically with the choice of the
clustering method, the selected variables, the distance measure, and the number of clusters.
Moreover, depending on the selected validity measure, different solutions result for the
optimal number of clusters. Despite the variety of cluster results, each partition could still be
informative and valuable. The results can give an interesting insight into the multivariate data
structure, even if the validity measure does not suggest the chosen number of clusters is
optimal. Thus, it is desirable to perform cluster analysis in an exploratory context, by
changing the cluster parameters and visually inspecting the results.
For this purpose, a statistical tool has been developed in R (freely available at http://cran.rproject.org) as the contributed package clustTool. Besides the selection of data, a background
map (optional), variables and coordinates, different parameters, like the distance measure, the

17

clustering method, the number of clusters, and the validity measure can be selected.
Depending on the selection, the clusters are presented in maps (see, e.g., Fig. 7) and plots of
the cluster centres are provided (see, e.g., Fig. 8). Additionally, a summary is provided and
information about the clustering is saved in an object in the R workspace. Figure 6 shows the
main menu of the tool clustTool.
Figure 6: Main menu of the tool clustTool for an exploratory use of cluster analysis. Cluster
results are presented in maps, and plots of the cluster centres are provided.

10. RESULTS AND THEIR GRAPHICAL PRESENTATION


The algorithm Mclust with 6 clusters was applied to 40 elements of the Kola O-horizon data.
The validity measure BIC was used for evaluating each individual cluster. Higher values of
the BIC indicate more informative clusters. Therefore, the BIC value is used for assigning
grey scales to the observations in the maps. Figure 7 shows the resulting clusters in 6 maps.
Cluster 3 shows low BIC values, therefore the observations are visualised as light grey points.
Cluster 5 shows the input of sea spray along the Norwegian coast. Cluster 6 identifies the
core areas of contamination surrounding Monchegorsk and Nikel/Zapoljarnij.
Figure 7: Mclust for the log-transformed and standardised 40 elements of the O-horizon data.
6 clusters were chosen and each cluster is evaluated with the BIC measure, resulting in
different grey scales for the observations in the maps.
In general, not only are the location of the single clusters on the maps of interests but also the
geochemical composition of the clusters. For this purpose, a plot of the cluster centres is
presented in Figure 8, which aids the interpretation of the processes behind the clusters. The
cluster centre is the element-wise mean of all observations of a cluster. Therefore, for each
cluster all elements used for clustering are presented. In Figure 8 the resulting means for all 6
clusters presented in Figure 7 are horizontally arranged. Since the variables used for
clustering were standardised, they each make the same contribution to the cluster analysis. If
single elements show very high or low means for a cluster, they are highly influential for that
cluster. For example, Figure 8 shows high means of the elements Cu, Ni, Co, As and Mo for
cluster 6, identifying the Russian nickel industry.

18

Figure 8: Plot of the cluster centres for the cluster analysis presented in Figure 7 (40 variables
of the O-horizon data, log-transformed, standardised). High or low values of the elements
suggest high influence of these elements on the observations in the corresponding cluster.
It can be interesting to carry out cluster analysis with a reduced number of variables, e.g. the
variables on the "sea spray" branch of the dendrogram shown in Figure 4 (Na, B, Mg, Ca, Sr,
pH) in order to better understand their specific influence on the observations. Figure 9 shows
the results of cluster analysis using the Mclust algorithm for 8 clusters. Cluster 7, which is
dominated by B, Mg and Na (Fig. 10), is clearly the sea spray cluster. Cluster 8, which is
dominated by Sr, adds new insight. It plots on top of the alkaline intrusions near Apatity. A
small cluster near Apatity (Cluster 4) is dominated by Ca, Sr and pH. It is probably directly
related to alkaline dust from the processing plant in Apatity. Thus there exist clearly two
different processes that determine the spatial distribution of the elements that were
interpreted as sea spray related: a true "sea spray component" and an "alkaline rock dust
component".
Figure 9: Mclust with 8 clusters for the selected pollution elements B, Ca, Mg, Na, pH and Sr.
Figure 10: Plot of the cluster centres for the results presented in Figure 9. Different chemical
processes are visible in the different clusters.
Carrying out the same cluster analysis with one cluster less results in the maps presented in
Figure 11 and the cluster centres shown in Figure 12. Although a "sea spray cluster" (cluster
6) is identified and one cluster relating to the alkaline intrusions (cluster 7) results are clearly
different from the previous example, demonstrating that very different results can be obtained
when changing the number of selected clusters.
Figure 11: Mclust with 7 clusters for the selected pollution elements B, Ca, Mg, Na, pH and Sr. The
resulting clusters are quite different from Figure 9.
Figure 12: Plot of the cluster centres for the results presented in Figure 11.

In a final example the results of fuzzy clustering on selected elements of the Kola O-horizon
data are shown. The variables B, Co, Cu, Mg, Na, Ni, indicative of two of the main processes
in the survey area (sea spray and industry) were log-transformed and standardised. Based on
the Euclidean distances, the FCM algorithm with 4 clusters is applied. The resulting

19

membership coefficients are shown in grey scales in Figure 13: higher membership of an
observation to a particular cluster is visualised by a darker point in the corresponding map.
The plot in Figure 14 with the cluster centres allows a better understanding of the resulting
clusters: Cluster 1 is a "sea spray" cluster, and Cluster 4 is a contamination cluster. Cluster 2
appears to indicate an outer rim of contamination, while all background observations
accumulate in Cluster 3.
Figure 13: Results of fuzzy clustering (FCM algorithm based on Euclidean distances) with 4
clusters of the elements B, Co, Cu, Mg, Na and Ni of the Kola O-horizon data (logtransformed, standardised). The membership coefficients of the observations to the clusters
are shown by grey scale in the maps.
Figure 14: This plot with the cluster centres supports an interpretation of the clusters
visualised in Figure 13.
Further clustering results for the Kola Project data as well as for other geochemical data sets
can be found in Templ (2003). This thesis also investigates and demonstrates the sensitivity
of cluster analysis methods to data preparation.

11. CONCLUSIONS
Like many other multivariate statistical methods, cluster analysis should be helpful to obtain
an overview of data sets with many observations and variables. It can be used to both
structure the variables or to group the observations. Which of the available variables are used
for cluster analysis and how they are prepared is crucial to the outcome. Using selected
elements or selected observations (e.g. sub-regions in the map) will in general give very
different results, some of them allow a clearer, and some a less clear, understanding of the
structure or classification of the data. As a general rule, symmetrisation of the data
distribution (e.g. by using a log-transformation for each variable) and standardisation is a
necessary part of data preparation before applying cluster analysis. Depending on the
clustering method, outliers can heavily affect the results and should be removed from the data
prior to analysis.
It is difficult to give a general recommendation concerning the cluster method to use. In our
case interesting results were obtained with model based clustering (algorithm Mclust), but

20

also other simpler algorithms led to useful interpretations and maps. If observations are
clustered (and not variables) the visualisation of the clusters in maps is most helpful, together
with a plot of the cluster centres an immediate impression of the cluster characteristics is
provided. These are often far more helpful than plots of validity measures or dendrograms,
which are difficult to read for applications with many observations.
We recommend the use of cluster analysis as an exploratory method. For this purpose, the
software package clustTool that runs under R has been developed. The user can choose data,
coordinates, background maps, variables, different distance measures, various cluster
algorithms, determine the number of clusters, and look at the results in plots. The visual
impression of the results, together with a pre-chosen validity measure is then helpful for
deciding on the parameter selection for clustering the data. Interesting results are not
necessarily obtained by tuning the parameters for cluster analysis in a statistically optimal
way. Expert knowledge also should be used for this purpose, e.g. for variable selection or
cluster evaluation. A flexible software tool used by experts can combine both strategies.
ACKNOWLEDGEMENT
The authors are grateful to Dr. Robert Garrett (Geological Survey of Canada) for many
stimulating discussions and for his suggestions leading to a significant improvement of an earlier
version of this manuscript.

REFERENCES
Aitchison, J., 1986. The statistical analysis of compositional data. Wiley, New York, 416pp.
yrs, M. and Reimann, C., 1995. Joint ecogeochemical mapping and monitoring in the scale
of 1:1 mill. in the West Murmansk Region and contiguous areas of Finland and Norway 1994-1996. Field Manual. Nor.Geol.Unders.Rep. 95.111, 33pp.
Bandemer, H. and Nther, W., 1992. Fuzzy Data Analysis. Kluwer Academic Publication,
360pp.
Bezdek, J.C., 1981. Pattern Recognition with Fuzzy Objective Function Algoritms. Plenum
Press, New York, 256pp.

21

Box, G.E.P. and Cox, D.R., 1964. An analysis of transformations. Journal of the Royal
Statistical Society (B) 26, 211-252.
Breiman, L., 1996. Bagging predictors. Machine Learning 24, 123-140.
Breiman, L., 2001. Random forests. Machine Learning 45 (1), 5-32.
Calinski, T. and Harabasz, J., 1974. A dendrite method for cluster analysis. Communications
in Statistics 3, 1-27.
Davis, J.C., 1973. Statistics and data analysis in geology. John Wiley & Sons, New York,
550pp.
Dempster, A.P, Laird, N.M, and Rubin, D.B., 1977. Maximum likelihood from incomplete
data via the EM algorithm (with discussion). Journal of the Royal Statistical Society, Series B
39, 1-38.
Dunn, J.C., 1973. A Fuzzy Relative of the ISODATA process and its use in detecting
compact well-separated clusters. Journal of Cybernetics 3, 32-57.
Everitt, B., 1974. Cluster Analysis. Heinemann Educational, London, 1974, 248pp.
Filzmoser, P., Reimann, C., and Garrett, R.G., 2005. Multivariate outlier detection in exploration
geochemistry. Computers and Geosciences 31, 579-587.
Fraley, C. and Raftery, A., 1998. How many clusters? Which clustering method? Answers via
model-based cluster analysis. The Computer Journal 41, 578-588.
Fraley, C. and Raftery, A.E., 2002. Model-based clustering, discriminant analysis, and
density estimation. Journal of the American Statistical Association 97, 611-631.
Gath, I. and Geva, A., 1989. Unsupervised optimal fuzzy clustering, IEEE Trans. on Pattern
Analysis and Machine Intelligence 11(7), 773-781.
Gower, J.C., 1966. Some distance properties of latent root and vector methods used in
multivariate analysis. Biometrika 53, 325-338.

22

Gordon, A.D., 1999. Classification. Chapman & Hall/CRC, Boca Raton, 2nd edition, 256pp.
Gustafson, D.E. and Kessel, W., 1979. Fuzzy clustering with a fuzzy covariance matrix. Proc.
IEEE-CDC 2, 761-766.

Haldiki, M., Batistakis, Y., and Vazirgiannis, M., 2002. Cluster validity methods. SIGMOD
Record 31, 40-45.
Hartigan, J., 1975. Clustering Algorithms. John Wiley and Sons, New York, 351pp.
Hubert, L. and Arabie, P., 1985. Comparing partitions. Journal of Classification 2, 193218.
Kaufman, L. and Rousseeuw, P.J., 1990. Finding Groups in Data. John Wiley & Sons, Inc.,
New York, 368pp.
Lance, G.N. and Williams, W.T., 1966. Computer programs for classification. Proceedings of
the Australian National Committee on Computing and Automatic Control Conference, Canberra.
Paper 12/3, 304-306.
Le Maitre, R.W., 1982. Numerical Petrology. Elsevier, Amsterdam, 281pp.
Leisch, F., 1998. Ensemble methods for neural clustering and classification. PhD thesis,
Institut fr Statistik, Wahrscheinlichkeitstheorie und Versicherungsmathematik, Technische
Universitt Wien, Austria, 130pp.
Leisch, F., 1999. Bagged clustering. Working Paper 51, SFB Adaptive Information Systems
and Modeling in Economics and Management Science, Wirtschaftsuniversitt Wien,
Austria, 11pp.
Leisch, F., 2006. A toolbox for k-centroids cluster analysis. Computational Statistics and Data
Analysis, 2006. Accepted for publication.
MacQueen, J., 1967. Some methods for classification and analysis of multivariate observations.
In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability,
eds L. M. Le Cam & J. Neyman, 1, Berkeley, CA: University of California Press, 281-297.

23

Martinetz T., Berkovich S., and Schulten, K., 1993. Neural-gas network for vector quantization
and its application to time-series prediction. IEEE Transactions on Neural Networks 4 (4), 558569.
Niskavaara, H., 1995. A comprehensive scheme of analysis for soils, sediments, humus and
plant samples using inductively coupled plasma atomic emission spectrometry (ICP-AES). In:
Autio, S. (ed.): Geological Survey of Finland, Current Research 1993-1994. Geological
Survey of Finland, Espoo, Special Paper 20, 167-175.
Reimann, C. and Filzmoser, P., 2000. Normal and lognormal data distribution in
geochemistry: death of a myth. Consequences for the statistical treatment of geochemical and
environmental data. Environmental Geology 39/9, 1001-1014.
Reimann, C. and Wurzer, F., 1986. Monitoring accuracy and precision - improvements by
introducing robust and resistant statistics. Mikrochimica Acta 1986 II, No.1-6, 31-42.
Reimann, C., yrs, M., Chekushin, V.A., Bogatyrev, I., Boyd, R., Caritat, P. de, Dutter, R.,
Finne, T.E., Halleraker, J.H., Jger, ., Kashulina, G., Niskavaara, H., Lehto, O., Pavlov, V.,
Risnen, M. L., Strand, T., and Volden, T., 1998. Environmental Geochemical Atlas of the
Central Barents Region. NGU-GTK-CKE special publication. - Geological Survey of
Norway, Trondheim, Norway, 745pp.
Reimann, C., Filzmoser, P., and Garrett, R.G., 2002. Factor analysis applied to regional
geochemical data: problems and possibilities. Applied Geochemistry 17, 185-206.
Ripley, B. D., 1996. Pattern Recognition and Neural Networks. Cambridge, 416pp.
Rock, N.M.S., 1988. Numerical Geology. Lecture Notes in Earth Sciences 18, Springer
Verlag, New York-Berlin-Heidelberg, 427pp.
Templ, M., 2003. Cluster Analysis applied to Geochemical Data. Diploma Thesis, Vienna
University of Technology, Vienna, Austria, 137pp.
Yeung, K. and Ruzzo, W., 2001. An empirical study on principal component analysis
for clustering gene expression data. Bioinformatics 17(9), 763-774.

24

TABLES
ELEMENT
DL
Ag
0,02
Al
0,2
As
0,05
B
0,8
Ba
0,05
Be
0,02
Bi
0,02
C
1000
Ca
5
Cd
0,02
Co
0,03
Cr
0,4
Cu
0,01
Fe
10
H
1000
Hg
0,04
K
200
La
0,7
Mg
10
Mn
1
Mo
0,01
N
1000
Na
10
Ni
0,3
P
15
Pb
0,04
Rb
0,5
S
15
Sb
0,01
Sc
0,1
Se
0,8
Si
20
Sr
0,2
Th
0,04
Tl
0,01
U
0,004
V
0,02
Y
0,1
Zn
0,4
other parameters
pH
0,1
EC
0,1
LOI
0,1

%<
DL
0
0
0
0,2
0
25,1
0
0
0
0
0
0
0
0
0
0
0
4,5
0
0
0
0
3,4
0
0
0
0
0
0
0,5
88
0
0
0
0
0
0
0
0
0
0
0

MIN
MED
MAX MAD
0,025
0,2
4,79
0,16
372
1890
20600 1201
0,364
1,16
43,5
0,46
<0.8
2,15
13
0,7
13,9
76
290
30,3
<0.02
0,04
1,87
0,04
0,029
0,159
1,12
0,08
153000 450000 508000 3710
460
2960
25400 786
0,07
0,3
1,39
0,11
0,21
1,57
96
1,11
0,39
2,91
109
1,75
2,7
9,7
4080 5,14
430
1970
44800 1245
22000 61000 71000 444
0,094
0,227
0,974 0,05
300
1000
5700
297
<0.7
2,3
139
1,78
240
750
23800 297
11,1
126
5470
108
0,086
0,258
5,45
0,1
5000
13000 20000 300
<10
60
2350 29,7
1,5
9,2
2880 7,74
192
930
9280
208
4,1
19
1110 7,41
0,68
5,8
33
2,76
400
1530
3830
297
0,016
0,183
0,962 0,08
<0.1
0,5
4,1
0.3
<0.8
<0.8
7,4
0.4
290
530
940
74,1
6,1
29
1430 13,6
0,06
0,35
15,4
0,25
0,02
0,09
0,56
0,05
0,008
0,099
14,3
0,07
1,1
4,9
49
2,39
0,2
0,9
69
0,59
12
46
198
15,1
3,2
5,53
33,5

3,85
13
89,8

5,6
23
98,8

0,22
2,92
6,52

Table 1: Elements and summary statistics (minimum (MIN), median (MED), maximum
(MAX) and spread (expressed as median absolute deviation MAD) for the Kola O-horizon
data set used here (from Reimann et al., 1998). In addition the detection limit (DL) and the
number of samples below detection (expressed in %) are given.

25

FIGURES

Figure 1: General location map of the study area for the Kola Project (Reimann et al., 1998).
Locations named in the text are given.

26

Figure 2: Average linkage clustering for 40 variables of the O-horizon data (log-transformed,
standardised). The cluster results for different methods and a fixed number of clusters are
compared with the corresponding results for bootstrap samples of the data using the Rand
index. The boxplots show the resulting Rand indices.

27

Figure 3: Resulting validity measures of clustering the 40 variables of the O-horizon data
(log-transformed, standardised) with different methods and based on different distance
measures (rf is the random forest distance). The validity measure changes with varying
number of clusters. The optimal number of clusters is indicated by a knee before or after the
measure increases significantly.

28

Figure 4: Clustering the 40 variables of the O-horizon data (log-transformed, standardised)


using average linkage based on the Euclidean distance between the variables. The result is
shown by a dendrogram. Variables on a tree can be selected for further clustering of the
objects.

29

Figure 5: Clustering the 40 variables of the O-horizon data (log-transformed, standardised)


using single linkage based on the Euclidean distance between the variables. The result is quite
different from Figure 4.

30

Figure 6: Main menu of the tool clustTool for an exploratory use of cluster analysis. Cluster
results are presented in maps, and plots of the cluster centres are provided.

31

Mclust
1

BIC = -9289.88
obs = 110

BIC = -4167.61
obs = 41

BIC = -29646.24
obs = 308

BIC = -4156.59
obs = 30

BIC = -7838.78
obs = 71

BIC = -6345.8
obs = 57

Figure 7: Mclust for the log-transformed and standardised 40 elements of the O-horizon data.
6 clusters were chosen and each cluster is evaluated with the BIC measure, resulting in
different grey scales for the observations in the maps.

32

Number of observations for each cluster


110

41

308

30

71

57

Cu
Al
Sc
Y
U
N
S Th pH

Mg
Na

Mo

Cr
CdFe
Bi
V
Hg
Rb
Sc
Na
pH
Ag
Rb
Pb
Co
H
Sr Zn
Sb
N Ca La
Co
LO
Mn Sc
Cd
Y
Mn
Ca
Si
Pb
MnPb
Sr C Ba
SSiTl
Ag Cr
Mg
Ca
LO
N
B
U Co
C
H
Na
Sb
As
Hg
K
Ag Cr
Zn
Zn
K
Sr
Al
H
SSr CoAs
Th C Ba Hg
Ba MnPSb
Co
U
Si
Tl
Mg
Th
pH
LO
Ag
B
Si
N
As
Cd
Mg
Ba
K Ni
K P
Cd
Co
Si V pHBa Mo
VH
MoRb VC Bi
Y
C
Rb
Hg
K
Hg
Sc
H
LO
Cu
Mo
BiCu Ni
LO
La
Pb
Pb
B Cr
Al
Fe
Tl
As
Y
Rb
Al La
U
Ag
Zn
Mo
Tl
Th
BiCu Ni
Na U
Fe Na Sc
P
Na Sr
Y
Ca
P
MoSbTh
Cu Ni
Sb
La
Ca
Cu
K
BCo
P
Sc
B
Sr
Cd
Si ZnCo
N
Fe
Al
Mg S
MnPb
Co
pH
Cd
Co
Mg
Cr Ni
As
Ba
Rb TlV
Mn
Hg
Zn
Sb
AgBi
Co
Ca
Tl

Bi

Th
V
UY

Cr
Al Fe
La

Fe
La

pH

-2

cluster means

Ni

Co
As

LO
C
H

Cluster number

Figure 8: Plot of the cluster centres for the cluster analysis presented in Figure 7 (40 variables
of the O-horizon data, log-transformed, standardised). High or low values of the elements
suggest high influence of these elements on the observations in the corresponding cluster.

33

Mclust
1

BIC = -4386.34
obs = 358

BIC = -157.26
obs = 8

BIC = -419.24
obs = 29

BIC = -66.92
obs = 3

BIC = -405.29
obs = 30

BIC = -679.37
obs = 62

BIC = -1479.12
obs = 110

BIC = -306.88
obs = 17

Figure 9: Mclust with 8 clusters for the selected pollution elements B, Ca, Mg, Na, pH and Sr.

34

Number of observations for each cluster


358

29

Sr

Ca

30

62

110

pH

Mg

Sr
Na
B

Mg

Sr
-2

cluster means

B
Na
pH
Ca S r Ca
NapH B
B Mg

17

pH
Ca
pH CaNa
B MgSr B MgSr
pH

Na
B MgSr

Mg
Na
NapH
pH
Ca Sr B
Ca
Mg

Ca

-4

Na

Cluster number

Figure 10: Plot of the cluster centres for the results presented in Figure 9. Different chemical
processes are visible in the different clusters.

35

Mclust
1

BIC = -2005.32
obs = 188

BIC = -1350.99
obs = 104

BIC = -450.13
obs = 32

BIC = -300.06
obs = 26

BIC = -1081.65
obs = 145

BIC = -557.52
obs = 68

BIC = -1306.9
obs = 54

Figure 11: Mclust with 7 clusters for the selected pollution elements B, Ca, Mg, Na, pH and Sr. The
resulting clusters are quite different from Figure 9.

36

Number of observations for each cluster


104

32

26

145

68

54

188

pH

Ca

Na
B Mg

Mg

Sr
pH

Sr

Na

SrpH B Mg
Na

Ca

Ca
Mg Srp H
B Na

Ca
Ca

Na
Mg
BCa

Ca

pH B Na
B
Mg Sr
Mg SrpH

Sr
pH

-2

-1

cluster means

Na

-3

Figure 12: Plot of the cluster centres for the results presented in Figure 11.

37

FCM, Euclidean
108

188

224

97

Figure 13: Results of fuzzy clustering (FCM algorithm based on Euclidean distances) with 4
clusters of the elements B, Co, Cu, Mg, Na and Ni of the Kola O-horizon data (logtransformed, standardised). The membership coefficients of the observations to the clusters
are shown by grey scale in the maps.

38

Number of observations for each cluster


108

188

224

97
Cu Ni
Co

B
Co Ni
Cu

Cluster means

MgNa

MgNa

Cu
B

Ni

Co
CuNiMgNa

-2

-1

Co

Na
Mg

Cluster number

Figure 14: This plot with the cluster centres supports an interpretation of the clusters
visualised in Figure 13.

39

You might also like