Professional Documents
Culture Documents
x1
Introduction
As instrumental chemical analysis techniques have
become more sophisticated with increasing levels of
automation, the number of samples routinely analyzed has grown and the amount of data per sample
has increased and can appear overwhelming.
Multivariate classication techniques attempt to
make sense of these data by identifying inherent patterns that may provide insight into the structure and
form of the data and, hence, of the samples themselves. This general area of study is referred to as
pattern recognition; a eld rich in applications in
analytical science. For example, the identication of
the origin of goods and foodstuffs is an important
task for the analyst. Given a limited set of possibilities then the problem at hand is one of classication.
From a training set of samples of known origin a
classication scheme is developed that can classify
unknown, test samples. In the biochemical eld,
nuclear magnetic resonance (NMR) spectroscopy of
biouids and cells provides a unique insight into
changes in metabolism caused by drugs and toxins.
Although biological NMR spectra are extremely
complex the essential diagnostic parameters are
carried in the overall patterns of the spectra. Metabonomics employs pattern recognition methods to
interrogate the databases of proton NMR spectra.
This approach allows a mathematical classication
of toxicity based on disparate types of multidimensional metabolic data so giving new insights into the
modes and biochemical mechanisms of toxicity. This
work is of value in the prediction of toxicity in drug
development studies.
The aims of pattern recognition are to determine
summarizing structure within analytical data, using,
for example, exploratory data analysis techniques
and cluster analysis, and to identify such patterns
(classication) according to their correspondence
with previously characterized examples. Many numerical techniques are available to the analytical scientist wishing to interrogate their data, but all have
the same starting point. The data are expressed in a
matrix form in which each sample, or object, is described by a vector of measurements. This representation leads logically to the concept of a pattern
space with as many dimensions as the number of
A
C
B
x3
x2
Figure 1 Three objects, A, B, and C, displayed in the threedimensional pattern space dened by the variables x 1, x 2, and
x 3. Objects A and B are closer to each other, and therefore
considered more similar, than to object C.
Data Reduction
It is often the case, particularly in spectrochemical
analysis, that the number of variables far exceeds the
number of samples. This is not surprising given that a
single infrared spectrum, for example, can comprise
absorption measurements at several thousand wavelengths. Although there are many statistical techniques available for identifying the major variables
(features) responsible for dening the pattern space
occupied by a sample, by far the most common technique employed in chemometric analysis of analytical data is the method of principal components
analysis (PCA). The method is an important tool for
analysts in exploratory data analysis.
PCA involves rotating and transforming the
original axes representing the original variables into
new axes, so that the new axes lie along the directions of maximum variance of the data. These
new axes are orthogonal, i.e., the new variables are
uncorrelated. Because of the high correlation that
frequently exists between analytically measured
22
PC1
x1
x2
x3
Figure 2 The rst principal component, PC1, is a new axis
representing the combination of original variables providing the
greatest variance in the data.
d1;2
Xm
2
X
X
1;j
2;j
j1
1
m Variables
red
n
Objects
mm
mm
Y = X . Lred
m
m red
Lred
m red
Figure 3 The original data, X, comprising n objects or samples described by m variables, is converted to a dispersion (covariance or
correlation) matrix C. The eigenvalues, l, and eigenvectors, L, are extracted from C. A reduced set of eigenvectors, Lred, i.e., selected
and the original data projected into this new, lower-dimensional pattern space Y.
x1
A
d(A,B)
x1
x2
x3
x3
x2
Figure 4 In multidimensional pattern space the Euclidean distance, dA;B , between two objects A and B is provided by the
square root of the sum of the squares of the differences between
the values of the dening variables, Dxj.
Hierarchical cluster analysis is noniterative and various implementation methods are commonly encountered in analytical science. Initially, each object in the
dataset is considered as a separate cluster. At each
subsequent stage of the algorithm the two clusters
that are most similar are merged to create a new
cluster. The algorithm terminates when all clusters
are combined. The number of clusters in the dataset
does not need to be known or provided a priori.
Using the original data matrix, a suitable matrix of
similarity measures between objects is rst constructed (e.g., the Euclidean distance matrix). From this
similarity matrix, the most similar objects are combined to produce a new grouped object and the
process repeated until all objects have been included
in a single cluster. The choice of an appropriate similarity metric and the manner in which objects are
grouped (or clustered) gives rise to many combinations of potential methods. Popular interpoint distances used for clustering are the nearest neighbor,
furthest neighbor, and the mean (Figure 5).
The choice of measure may greatly inuence the
result of clustering. The nearest neighbor metric will
link together quite distinct clusters if there exists a
path of closely located points connecting the two
clusters. The furthest neighbor algorithm does not
suffer this problem but can be very sensitive to outliers
and will not detect elongated clusters. The hierarchical structure of clusters provided by the algorithm is
represented graphically by a dendrogram (Figure 6).
The dendrogram illustrates the cluster merging
sequence and the corresponding values of the similarity
measure employed. A threshold similarity value,
selected by the user, splits the data into the perceived
correct number of clusters. Generally, the threshold
level should be chosen so that intercluster distances are
considerably greater than the intracluster distances.
The application of unsupervised pattern recognition methods should be undertaken with caution.
Many factors will inuence the results of the analysis, including variable scaling, metric used, similarity
measure, clustering criterion, number of data points,
etc. Dynamic clustering methods can be computationally efcient, particularly with large data sets,
but unrealistic groupings can easily be achieved and
analysis with several different cluster representative
functions is recommended. If the number of samples
is small then describing some cluster representation
function is meaningless and in such situations hierarchical methods are more useful.
Cluster analysis is not a statistical based operation
and should not be employed to prove the existence of
groups. Rather, cluster analysis is best employed as
24
x1
dM
dFN
dNN
x2
x3
Figure 5 The distance between object A and the existing cluster can be dened by its proximity to the clusters nearest
neighbor, dNN, its center, dM, or its furthest neighbor, dFN.
0.15
Distance
Threshold
0.10
0.05
A
B D
E F
Figure 6 A dendrogram provides a two-dimensional representation of the similarity between objects according to the distance
between an object and a cluster. A threshold level, selected by
the user, denes the number of distinct groups in the data.
5
Group 1
4
x2
Pnk
x% j k
A
Group 2
1
5
xi;j k
nk
i1
x1
Figure 7 With nearest neighbor classication a cluster is dened by the elements in the boundary layer and an object is
classied as belonging to that group containing its nearest
neighbor. Object A will be assigned to group 1 rather than group 2.
2
3
The weights, or variables coefcients used in calculating D (eqn [3]), are determined by
w x% j 1 x% j 2S1
w0 12x% j 1 x% j 2S1 x% j 1 x% j 2
4
S1 S2
n1 n2 2
6
7
The application of discriminant analysis can be extended to include probabilities of class membership
and, assuming a multivariate normal distribution of
data, condence intervals for class boundaries can be
calculated. The Bayes rule for classication simply
states that an object should be assigned to that
26
Classification boundary
8
x2
Distance to group 2
2
0
0
0
0
x1
Figure 8 Linear discriminant analysis provides a linear partition
boundary between the two known groups, bisecting the line between the cetroids of the two groups.
Distance to group 1
Figure 10 A Coomans plot provides a visual representation of
classication results and probability boundaries can also be displayed (P1 0.95 and P2 0.95 levels are illustrated here).
given by
Group 1
30
20
Pxj1 P1 Pxj2 P2
Group 2
10
10
20
30
Discriminant score
Figure 9 A discriminant plot projects the data on to a single
axis (dened by the discriminant function).
8
Pxj1 P1
Pxj1 P1 Pxj2 P2
9
10
11
1
2pjSj1=2
13
14
Selecting an appropriate value for P(1) then dm1 values can be obtained for a range of dm2 values, and
similarly dm2 values for a set of given dm1 gures.
Classication and discriminant analysis algorithms
are available with all multivariate statistical software
packages. New or modied procedures are regularly
being introduced and the application of such
techniques and methods in analytical science is
growing.
See also: Chemometrics and Statistics: Statistical
Techniques; Expert Systems; Multicriteria Decision
Making. Computer Modeling. Nuclear Magnetic Resonance Spectroscopy Techniques: Multidimensional
Proton.
Further Reading
Adams MJ (2004) Chemometrics in Analytical Spectroscopy, 2nd edn. Cambridge: Royal Society of Chemistry.
Introduction
Calibration is the process of measuring the instrument response (y) of an analytical method to known
concentrations of analytes (x) using model building
and validation procedures. These measurements,
along with the predetermined analyte levels, encompass
a calibration set. This set is then used to develop a
mathematical model that relates the amount of sample to the measurements by the instrument. In some
cases, the construction of the model is simple due to
relationships such as Beers Law in the application of
ultraviolet spectroscopy.
Traditional univariate calibration techniques involve the use of a single instrumental measurement
to determine a single analyte. In an ideal chemical
measurement using high-precision instrumentation,
an experimenter may obtain selective measurements
linearly related to analyte concentration (Figure 1A).
However, univariate techniques are very sensitive to
the presence of outlier points in the data used to t a
particular model under normal experimental conditions. Often, even only one or two outliers can
seriously skew the results of a least squares analysis.
The problems of selectivity and interferences (chemical and physical) also limit the effectiveness of
univariate calibration methods causing some degree of nonlinearity (Figure 1B). In addition, such
calibration techniques are not well suited to the multitude of data collected from the sensitive, highthroughput instrumentation currently being used in
the analytical sciences. These datasets often contain
large amounts of information, but in order to fully
extract and correctly interpret this information,
analytical methods incorporating a multivariate
approach are needed.
In multivariate calibration, experimenters use
many measured variables (x1, x2,y, xk) simultaneously for quantifying the target variable (a variable
whose value is to be modeled and predicted by others
(i.e., the variable on the left of the equal sign in linear
regression)). In order to effectively use multivariate
techniques, proper experimental design is essential.
Experimental design allows the proper assessment of
systematic variability (e.g., interferences) and helps
in minimizing, for example, the effects of random
noise. Experimental design is often limited by problems associated with the estimation of experimental