You are on page 1of 7

CHEMOMETRICS AND STATISTICS / Multivariate Classication Techniques 21

Multivariate Classication Techniques


M J Adams, RMIT University, Melbourne, VIC, Australia
& 2005, Elsevier Ltd. All Rights Reserved.

x1

Introduction
As instrumental chemical analysis techniques have
become more sophisticated with increasing levels of
automation, the number of samples routinely analyzed has grown and the amount of data per sample
has increased and can appear overwhelming.
Multivariate classication techniques attempt to
make sense of these data by identifying inherent patterns that may provide insight into the structure and
form of the data and, hence, of the samples themselves. This general area of study is referred to as
pattern recognition; a eld rich in applications in
analytical science. For example, the identication of
the origin of goods and foodstuffs is an important
task for the analyst. Given a limited set of possibilities then the problem at hand is one of classication.
From a training set of samples of known origin a
classication scheme is developed that can classify
unknown, test samples. In the biochemical eld,
nuclear magnetic resonance (NMR) spectroscopy of
biouids and cells provides a unique insight into
changes in metabolism caused by drugs and toxins.
Although biological NMR spectra are extremely
complex the essential diagnostic parameters are
carried in the overall patterns of the spectra. Metabonomics employs pattern recognition methods to
interrogate the databases of proton NMR spectra.
This approach allows a mathematical classication
of toxicity based on disparate types of multidimensional metabolic data so giving new insights into the
modes and biochemical mechanisms of toxicity. This
work is of value in the prediction of toxicity in drug
development studies.
The aims of pattern recognition are to determine
summarizing structure within analytical data, using,
for example, exploratory data analysis techniques
and cluster analysis, and to identify such patterns
(classication) according to their correspondence
with previously characterized examples. Many numerical techniques are available to the analytical scientist wishing to interrogate their data, but all have
the same starting point. The data are expressed in a
matrix form in which each sample, or object, is described by a vector of measurements. This representation leads logically to the concept of a pattern
space with as many dimensions as the number of

A
C
B

x3

x2

Figure 1 Three objects, A, B, and C, displayed in the threedimensional pattern space dened by the variables x 1, x 2, and
x 3. Objects A and B are closer to each other, and therefore
considered more similar, than to object C.

variables in the vector, and an object occupying a


single point in that multidimensional space. Furthermore, it is assumed that points representing similar
patterns (i.e., similar samples) will tend to cluster in
the pattern space. Conversely, samples of dissimilar
patterns will lie in different regions of space (Figure 1).

Data Reduction
It is often the case, particularly in spectrochemical
analysis, that the number of variables far exceeds the
number of samples. This is not surprising given that a
single infrared spectrum, for example, can comprise
absorption measurements at several thousand wavelengths. Although there are many statistical techniques available for identifying the major variables
(features) responsible for dening the pattern space
occupied by a sample, by far the most common technique employed in chemometric analysis of analytical data is the method of principal components
analysis (PCA). The method is an important tool for
analysts in exploratory data analysis.
PCA involves rotating and transforming the
original axes representing the original variables into
new axes, so that the new axes lie along the directions of maximum variance of the data. These
new axes are orthogonal, i.e., the new variables are
uncorrelated. Because of the high correlation that
frequently exists between analytically measured

22

CHEMOMETRICS AND STATISTICS / Multivariate Classication Techniques

variables, it is generally found that the number of


new variables needed to describe most of the sample
data variance is signicantly less than the number of
original variables. Thus, PCA provides a means to
reduce the dimensionality of the parameter space. In
addition, PCA can reveal those variables, or combinations of variables, that describe some inherent
structure in the data and these may be interpreted in
chemical or physicochemical terms.
Principal components are linear combinations of
original variables, and the linear combination with
the largest variance is the rst principal component
(Figure 2). Once this is determined, then the search
proceeds to nd a second normalized linear combination
that has most of the remaining variance and is
uncorrelated with the rst principal component. The
procedure is continued, usually until all the principal
components have been calculated and a subset of the

principal components is then selected for further


analysis and for interpretation. The scheme for performing PCA is illustrated in Figure 3.
PCA can often be so effective and efcient in reducing the dimensionality of analytical data that it
can provide immediate visual indication of patterns
within data and is a commonly employed exploratory
technique.

Unsupervised Pattern Recognition


In many initial, exploratory studies information concerning the presence of groups of similar samples, or
the pattern design of such groups as may be present,
is not known. This may be due to a lack of knowledge of the pattern generating process. Indeed, this
could be the principal aim of the study to determine
the inherent designs of patterns existing within the
analytical data. The purpose of unsupervised pattern
recognition is to identify groups, or clusters, of
similar objects characterized by a multivariate feature
set. No a priori information about the structure of
possible groups present is required.
The general aim is to attempt to partition a given
dataset into homogeneous subsets (clusters) according to similarities of the points in each subset and
their relationship to the elements of other subsets.
This approach is referred to as cluster analysis.
Quantication of the degree of similarity of objects
expressed in pattern space is a key process in cluster
analysis.
It is generally accepted without proof that similarity and distance are complementary; objects close
together in multidimensional space are more alike
than those further apart. Most of the similarity
measures used in practice are based on some distance
function, and whilst many such functions are
referenced in the literature the most common is the
simple Euclidean distance metric.
In multidimensional space the Euclidean distance,
d(1,2), between two objects is given by

PC1

x1

x2

x3
Figure 2 The rst principal component, PC1, is a new axis
representing the combination of original variables providing the
greatest variance in the data.

d1;2

Xm
2
X

X

1;j
2;j
j1

1

m Variables
red

n
Objects

mm

mm

Y = X . Lred
m

m red

Lred

m red

Figure 3 The original data, X, comprising n objects or samples described by m variables, is converted to a dispersion (covariance or
correlation) matrix C. The eigenvalues, l, and eigenvectors, L, are extracted from C. A reduced set of eigenvectors, Lred, i.e., selected
and the original data projected into this new, lower-dimensional pattern space Y.

CHEMOMETRICS AND STATISTICS / Multivariate Classication Techniques 23


Hierarchical Cluster Analysis

x1

A
d(A,B)

x1
x2
x3

x3
x2

Figure 4 In multidimensional pattern space the Euclidean distance, dA;B , between two objects A and B is provided by the
square root of the sum of the squares of the differences between
the values of the dening variables, Dxj.

where x1 and x2 are the feature vectors of objects 1


and 2 and m is the number of variables (Figure 4).
The Euclidean distance can be calculated for all pairs
of objects and the data matrix transformed into a
square, symmetric distance matrix.
There are basically two approaches to data clustering, dynamic methods and hierarchical techniques.
Dynamic Clustering

Dynamic clustering employs an iterative algorithm to


optimize some clustering criterion function such as
the average afnity of points to a clusters mean value
the c-means algorithm. During each iteration, a
data point is assigned to a cluster on the basis of its
closeness to the clusters center. The cluster centers
are recalculated to reect changes brought about by
data point assignments, and the new cluster models
are used in the next iteration to reclassify the data.
The process is continued until a stable partition is
obtained, i.e., until no object is reclassied. The
number of expected clusters or groups must be specied before commencing the analysis.
With the c-means algorithm a mean vector is taken
as representing a cluster and for asymmetric shaped
clusters this will be inadequate. The algorithm can
readily be generalized to more sophisticated models
so that a cluster is represented not by a single point
(such as described by a mean vector) but rather by a
function describing some attribute of the clusters
shape in pattern space.

Hierarchical cluster analysis is noniterative and various implementation methods are commonly encountered in analytical science. Initially, each object in the
dataset is considered as a separate cluster. At each
subsequent stage of the algorithm the two clusters
that are most similar are merged to create a new
cluster. The algorithm terminates when all clusters
are combined. The number of clusters in the dataset
does not need to be known or provided a priori.
Using the original data matrix, a suitable matrix of
similarity measures between objects is rst constructed (e.g., the Euclidean distance matrix). From this
similarity matrix, the most similar objects are combined to produce a new grouped object and the
process repeated until all objects have been included
in a single cluster. The choice of an appropriate similarity metric and the manner in which objects are
grouped (or clustered) gives rise to many combinations of potential methods. Popular interpoint distances used for clustering are the nearest neighbor,
furthest neighbor, and the mean (Figure 5).
The choice of measure may greatly inuence the
result of clustering. The nearest neighbor metric will
link together quite distinct clusters if there exists a
path of closely located points connecting the two
clusters. The furthest neighbor algorithm does not
suffer this problem but can be very sensitive to outliers
and will not detect elongated clusters. The hierarchical structure of clusters provided by the algorithm is
represented graphically by a dendrogram (Figure 6).
The dendrogram illustrates the cluster merging
sequence and the corresponding values of the similarity
measure employed. A threshold similarity value,
selected by the user, splits the data into the perceived
correct number of clusters. Generally, the threshold
level should be chosen so that intercluster distances are
considerably greater than the intracluster distances.
The application of unsupervised pattern recognition methods should be undertaken with caution.
Many factors will inuence the results of the analysis, including variable scaling, metric used, similarity
measure, clustering criterion, number of data points,
etc. Dynamic clustering methods can be computationally efcient, particularly with large data sets,
but unrealistic groupings can easily be achieved and
analysis with several different cluster representative
functions is recommended. If the number of samples
is small then describing some cluster representation
function is meaningless and in such situations hierarchical methods are more useful.
Cluster analysis is not a statistical based operation
and should not be employed to prove the existence of
groups. Rather, cluster analysis is best employed as

24

CHEMOMETRICS AND STATISTICS / Multivariate Classication Techniques

set of known group members to dene patterns and


develop partition functions. The aim is to identify
and quantify relationships between groups, and
assign unclassied objects to one of the groups.
A wide range of modeling algorithms is available
to perform this type of classication. Widely used
techniques include the k-nearest neighbor (k-NN)
method and linear discriminant analysis (LDA).

x1

dM

dFN

Nearest Neighbor Classication

dNN
x2

x3
Figure 5 The distance between object A and the existing cluster can be dened by its proximity to the clusters nearest
neighbor, dNN, its center, dM, or its furthest neighbor, dFN.

0.15

Distance

Threshold

0.10

0.05
A

B D

E F

Figure 6 A dendrogram provides a two-dimensional representation of the similarity between objects according to the distance
between an object and a cluster. A threshold level, selected by
the user, denes the number of distinct groups in the data.

part of the toolkit for exploratory data analysis. The


evidence for certain groups and clusters, and the
cause of structure found should be investigated by
other techniques.

Supervised Pattern Recognition


Whereas cluster analysis neither needs nor assumes
a priori information about cluster properties or pattern
design, supervised pattern recognition uses a training

For successful operation the nearest neighbor rules


for classication rely on knowing a large number of
correctly, previously classied patterns. The basic
principles are that samples that are close together in
pattern space are likely to belong to the same class or
have similar distributions of the classes. The rst idea
gives rise to formulation of the single nearest
neighbor rule, 1-NN, and the second provides for
extension of the rule to k-NNs.
A test object to be classied is assigned to that class
containing its nearest neighbor as measured by some
distance metric. The Euclidean distance is easily calculated and is often used. This 1-NN scheme can
readily be extended to more (k) neighbors with the
class assignment decided according to a majority
vote procedure, i.e., assignment to the class most
represented in the set of kNN classied objects.
Common values for k are 3 or 5 (Figure 7).
The sample-based decision boundary between
groups is dened by the relatively small number of
samples belonging to the outer envelopes of the clusters. Those samples deeply imbedded within clusters
do not contribute to dening the boundary and may
be discarded for classication of new objects. This
concept can be exploited for increasing the computational efciency of nearest neighbor classication
since the distance from a new, unclassied object
to every classied sample need not be calculated.
A variety of so-called condensing algorithms is available and aims to provide a subset of the complete
training dataset such that 1-NN classication of any
new pattern with the subset is identical to 1-NN
classication with the complete set.
Since nearest neighbor methods are based on similarity measured by some distance metric then
variable scaling and the units used to characterize
the data can inuence results. Variables with the
largest amount of scatter (greatest variance) will
contribute most strongly to the Euclidean distance
and in practice it may be advisable to standardize
variables before performing classication analysis.
Discriminant Function Analysis

In LDA, or discriminant function analysis, we are


seeking to create new synthetic features (variables)

CHEMOMETRICS AND STATISTICS / Multivariate Classication Techniques 25

This represents the ratio of the separation of the


means of the two groups to the within-group variance for the groups as given by the pooled covariance
matrix, S.
x% j 1 and x% j 2 are the vectors of the mean values
for variable j for groups (1) and (2), respectively, and
easily obtained from

5
Group 1
4

x2

Pnk

x% j k
A

Group 2
1

5

where n(k) is the number of objects in group (k), and


xi,j(k) is the value for object i of variable j in group (k).
The pooled covariance matrix, S, of two training
classes is given by

xi;j k
nk

i1

x1
Figure 7 With nearest neighbor classication a cluster is dened by the elements in the boundary layer and an object is
classied as belonging to that group containing its nearest
neighbor. Object A will be assigned to group 1 rather than group 2.

that are linear combinations of the original variables


and that best indicate the differences between the
known groups in contrast to the variable variances
within the groups.
The process of performing LDA aims to derive and
construct a boundary between the known classes of
the training objects using statistical parameters. This
boundary is derived using a discriminant function
that provides a value or score when applied to a test
object that indicates the group to which the new
object should be assigned.
If f(xi, k) is some measure of likelihood of object xi
belonging to group or class k, then the discriminant
score Di, for assigning xi to one of two classes is
given by
Di f xi ; k1  f xi ; k2

2

which may be interpreted as saying we classify test


object xi into class 1 if Di is positive, otherwise xi is
considered as belonging to class 2.
The value of the discriminant score is calculated
from a linear combination of the recorded values of
the variables describing the objects, each suitably
weighted to provide optimum discriminatory power.
For two variables
Di w0 w1 xi;1 w2 xi;2

3

The weights, or variables coefcients used in calculating D (eqn [3]), are determined by
w x% j 1  x% j 2S1
w0 12x% j 1  x% j 2S1 x% j 1 x% j 2

4

S1  S2
n1 n2  2

6

S(k) represents the covariance matrix of group (k).


The vector of weights coefcients can be thus be
calculated from the classied training data and the
discriminant score computed for each new, unclassied sample.
The discriminant function is linear; all the terms
are added together to give a single number, the discriminant score.
For higher-dimensional pattern space the boundary is a hyperplane of m  1 dimensionality, where m
is the number of variables. The partition boundary
between two classes is dened at Di 0 and in the
two-dimensional case it is given by
w0 w1 xi;1 w2 xi;2 0

7

The classication boundary bisects a line between the


centroids of the two clusters (Figure 8).
A useful result of performing LDA is the production of what is termed a discriminant plot, where
every data point (from the training set or the new
test sample set) is projected onto the discriminant function displayed as one-dimensional axis
(Figure 9).
The concept of a linear discriminant axis reduces
the multidimensional classication problem to a
single dimension, with the projection achieved so
that discrimination between classes is preserved as
well as possible.
Bayes Classication

The application of discriminant analysis can be extended to include probabilities of class membership
and, assuming a multivariate normal distribution of
data, condence intervals for class boundaries can be
calculated. The Bayes rule for classication simply
states that an object should be assigned to that

26

CHEMOMETRICS AND STATISTICS / Multivariate Classication Techniques


5

95% decision boundaries

Classification boundary
8

x2

Distance to group 2

2
0
0
0
0

x1
Figure 8 Linear discriminant analysis provides a linear partition
boundary between the two known groups, bisecting the line between the cetroids of the two groups.

Distance to group 1
Figure 10 A Coomans plot provides a visual representation of
classication results and probability boundaries can also be displayed (P1 0.95 and P2 0.95 levels are illustrated here).

given by
Group 1

30

20

Pxj1  P1 Pxj2  P2

Group 2

10

10

20

30

Discriminant score
Figure 9 A discriminant plot projects the data on to a single
axis (dened by the discriminant function).

group having the highest conditional probability. If


there are two possible classes, then a sample is
assigned to group 1 if
P1jx XP2jx

8

where P(1|x) is the conditional probability for group 1


given the pattern vector x. This conditional probability can be estimated from
P1jx

Pxj1  P1
Pxj1  P1 Pxj2  P2

9

where P(1) and P(2) are the probabilities of the sample


belonging to group 1 or group 2 in the absence of
analytical data.
P(x|1) express the conditional probability of the
pattern x arising from a member of group 1.
Thus, a sample is assigned to group 1 if
Pxj1  P1 4Pxj2  P2

10

and the partition boundary where a sample has equal


likelihood of belonging to group 1 or group 2, is

11

If the data are assumed to come from a population


having a multivariate normal distribution and it is
furthermore assumed that the covariance matrix of
each group is similar, then the conditional probability values can be calculated from the multidimensional normal distribution
Pxjk

1
2pjSj1=2

exp0:5x  x% k T S1 x  x% k  12

where S is the covariance matrix and x% k the vector


of variable means for group k.
The term x  x% k T S1 x  x% k is a squared metric
(the Mahalanobis distance), dm2k, representing the
distance of each object from a group center taking
into account correlation within the data.
Using the vectors of means for each group, then
values for dm1 and dm2 can be calculated for every
sample. A plot of dm1 against dm2 is referred to as a
Coomans plot and displays the results of classication (Figure 10).
On the same diagram probability boundaries
(condence levels) can also be displayed. By substitution into eqn [12], the partition boundary is given by
P1 exp0:5 dm21 P2 exp0:5 dm22

13

and since P(2) 1  P(1), then this equation can be


rearranged to
s


P1
2
dm1 dm2 2 ln
1  P1

14

CHEMOMETRICS AND STATISTICS / Multivariate Calibration Techniques 27

Selecting an appropriate value for P(1) then dm1 values can be obtained for a range of dm2 values, and
similarly dm2 values for a set of given dm1 gures.
Classication and discriminant analysis algorithms
are available with all multivariate statistical software
packages. New or modied procedures are regularly
being introduced and the application of such
techniques and methods in analytical science is
growing.
See also: Chemometrics and Statistics: Statistical
Techniques; Expert Systems; Multicriteria Decision
Making. Computer Modeling. Nuclear Magnetic Resonance Spectroscopy Techniques: Multidimensional
Proton.

Further Reading
Adams MJ (2004) Chemometrics in Analytical Spectroscopy, 2nd edn. Cambridge: Royal Society of Chemistry.

Brereton RG (2003) Chemometrics: Data Analysis for the


Laboratory and the Chemical Plant. Chichester: Wiley.
Coomans D and Massart DL (1992) Hard modelling in
supervised pattern recognition. In: Brereton RG (ed.)
Multivariate Pattern Recognition in Chemometrics,
pp. 249288. Amsterdam: Elsevier.
Devijver PA and Kittler J (1982) Pattern Recognition: A
Statistical Approach. London: Prentice-Hall International.
Einax JW, Zwanziger HW, and Geiss S (1997) Chemometrics in Environmental Data Analysis. Heidelberg: VCH.
Everitt B (1980) Cluster Analysis, 2nd edn. London: Heinemann Educational.
Malinowski E (1991) Factor Analysis in Chemistry. New
York: Wiley.
Nicholson JK, Connelly J, Lindon JC, and Holmes E
(2002) Metabonomics: A platform for studying drug
toxicity and gene function. Nature Reviews (Drug Discovery Volume 1): 153161.
Varmuza K (1980) Pattern Recognition in Chemistry. New
York: Wiley.

Multivariate Calibration Techniques


G Hanrahan, F Udeh, and D G Patil, California State
University, Los Angeles, CA, USA
& 2005, Elsevier Ltd. All Rights Reserved.

Introduction
Calibration is the process of measuring the instrument response (y) of an analytical method to known
concentrations of analytes (x) using model building
and validation procedures. These measurements,
along with the predetermined analyte levels, encompass
a calibration set. This set is then used to develop a
mathematical model that relates the amount of sample to the measurements by the instrument. In some
cases, the construction of the model is simple due to
relationships such as Beers Law in the application of
ultraviolet spectroscopy.
Traditional univariate calibration techniques involve the use of a single instrumental measurement
to determine a single analyte. In an ideal chemical
measurement using high-precision instrumentation,
an experimenter may obtain selective measurements
linearly related to analyte concentration (Figure 1A).
However, univariate techniques are very sensitive to
the presence of outlier points in the data used to t a

particular model under normal experimental conditions. Often, even only one or two outliers can
seriously skew the results of a least squares analysis.
The problems of selectivity and interferences (chemical and physical) also limit the effectiveness of
univariate calibration methods causing some degree of nonlinearity (Figure 1B). In addition, such
calibration techniques are not well suited to the multitude of data collected from the sensitive, highthroughput instrumentation currently being used in
the analytical sciences. These datasets often contain
large amounts of information, but in order to fully
extract and correctly interpret this information,
analytical methods incorporating a multivariate
approach are needed.
In multivariate calibration, experimenters use
many measured variables (x1, x2,y, xk) simultaneously for quantifying the target variable (a variable
whose value is to be modeled and predicted by others
(i.e., the variable on the left of the equal sign in linear
regression)). In order to effectively use multivariate
techniques, proper experimental design is essential.
Experimental design allows the proper assessment of
systematic variability (e.g., interferences) and helps
in minimizing, for example, the effects of random
noise. Experimental design is often limited by problems associated with the estimation of experimental

You might also like