Professional Documents
Culture Documents
Review
Multivariate calibration
M. Forina , S. Lanteri, M. Casale
Department of Pharmaceutical and Food Chemistry and Technology,
University of Genova, Via Brigata Salerno 13, 16147 Genova, Italy
Available online 28 March 2007
Abstract
The bases of multivariate calibration are presented with special attention to some points usually not considered or underevaluated, i.e., the
sampling design, the number of samples necessary to obtain a reliable regression model, the effect of noisy predictors, the significance of the
parameters used to evaluate the performance ability of the regression model.
2007 Elsevier B.V. All rights reserved.
Keywords: Multivariate calibration; Chemometrics; Regression
Contents
1.
2.
3.
4.
5.
6.
7.
8.
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Base knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Usual univariate analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Multicomponent analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Inverse calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Multivariate calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.1. Sufficient number of well selected samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2. Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3. Predictive optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4. Sufficient number of well selected predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.5. Pretreatments and fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.6. Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.7. Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.8. Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.9. Updating and transfer of calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Appendix A. Matrix algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.1. Scalar multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.2. Inner multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.3. Trace and determinant. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.4. Identity matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.5. Singular matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
0021-9673/$ see front matter 2007 Elsevier B.V. All rights reserved.
doi:10.1016/j.chroma.2007.03.082
62
63
63
63
63
63
64
65
66
69
69
70
71
71
72
72
73
74
75
75
75
75
75
76
76
62
1. Introduction
In the last few decades multivariate calibration has become an
important analytical tool in many different fields of application,
especially on food chemistry, pharmaceutical analysis, agriculture, environment, industrial and clinical chemistry. It is used
both for the determination of chemical species and of physical
quantities of interest in the chemical industry (e.g., octane number, viscosity), both in batch samples and in process control. It
is used also for the prediction of sensory scores, of biological
activity, of toxicity.
The reason of the large interest in multivariate calibration is
that the analytical procedure is fast and cheap, not very accurate
but enough accurate for many real problems.
The interesting chemical or physical quantity (response)
is obtained as a function of many measured quantities (predictors). Multivariate calibration uses non-specific predictors,
generally physical information from spectra, especially nearinfrared (NIR) spectra. However, it can be used with UV, visible,
Raman, mid-infrared, fluorescence, NMR and mass spectra,
with electrochemical predictors and with chemical predictors.
Moreover, in the study of biological activity, the predictors
are generally computed descriptors of the molecular structure.
The function that computes the response from the predictors
is obtained by means of chemometrics tools, able to extract from
many non-specific predictors a specific model.
76
77
77
77
77
78
78
79
80
81
81
82
83
84
84
84
85
85
87
87
88
90
92
92
92
92
92
92
92
63
(4a)
5. Multicomponent analysis
(5)
3. Data sets
yS = (ATSV AVS )
provided that V S and that the matrix ATSV AVS be invertible (at
least S independent equations).
In the case of mixtures of S chemical components only S 1
equations are necessary, because the last equation is obtained by
the condition:
S
(1)
s=1
(2)
(3)
where the intercept take into account some factors as the absorption of the solvent. Two standard samples, with different values
of y, are necessary to compute the model parameters a and b.
Then
y = b1 (x a) = b1 (x + c)
(4)
The models (2) or (4) require that the signal be specific, due only
to the analyte in the case of model (2), to the analyte and to a
constant factor in the case of model (4).
So, the analysis requires physical and chemical treatments to
eliminate interferents.
ys
S
s=1 ys
ATSV xV = BSV xV
(6)
=1
(7)
(8)
64
(9)
(10)
where M = V + 1, and XNM is the increasing matrix of the predictors, with added a columns of 1.
The vector of the coefficients can be computed by means of:
bM = (XTMN XNM )
XTMN yN
(11)
XTMN yN
(12)
(13)
65
model has general validity and the leverage points stabilize the
model.
In the case of multivariate calibration the same people can
use training sets with very heavy lack of uniformity.
The available samples to develop a multivariate calibration
model are:
3. sufficient number of well selected predictors (and consequently pre-treatment of the predictors, choice of the
regression technique and of the related techniques of elimination of useless predictors).
7.1. Sufcient number of well selected samples
The number of samples must be sufficient to:
(a) evaluate SDEP with a reduced uncertainty;
(b) explore all the factors of variability in the chemical samples
(especially chemical and physical matrix effects, but also
instrumental factors);
(c) have both training and test sets representative of the above
factors.
The samples must be selected to have a distribution close to
the uniform distribution.
No people select samples in the case of the usual univariate
calibration as shown in Fig. 1. For two reasons: first, the two right
samples have high leverage, so a large influence on the regression
model; second, the standard deviation of the residuals is a mean
measure, heavily influenced by the left samples, as shown in
Fig. 2. The presence of high leverage points always indicates
a bad distribution of the samples and sometimes that the linear
model is valid in a limited range. However, in some cases the
Fig. 2. The absolute value of the residuals for the example in Fig. 1. (A) Standard
deviation of the fitting error; (B) the same but without the contribution of the
two samples at the right.
(a) all the samples analyzed in the laboratory. For each sample
the spectrum and the response variables measured with a
reference technique are available;
(b) a number of samples selected among many candidate samples. For each candidate sample the spectrum is available.
The response variables will be determined only on the
selected samples.
In both cases a good sampling design selects the samples for
calibration with a uniform distribution.
In the first case the design can be performed both on the X
matrix (the matrix of the spectra) and on the y vector or the Y
matrix (the vector of the response or the matrix of the responses).
In the second case the design can be performed only on the X
matrix.
There are many techniques for uniform design. Kennard
Stone design [7] can be used to obtain one or more sets of
samples (e.g., a calibration and a test set).
Figs. 3 and 4 show the results with KennardStone sampling and selection of two sets. In the case of Fig. 3 the
design was performed on the two response variables, in the
case of Fig. 4 on the two first principal components. The two
sets can be used: or one for training and the second as
test set; or joined to use CV for validation. In both cases
the sets are well balanced, without a too large number of
samples with very similar characteristics. The KennardStone
designs should compared with the design used by Ruisanchez
et al. [8], were the responses are sorted, then the ordered
samples are assigned to the training and the test sets (the
first two to the training set, the third to the test set, and so
on).
66
7.2. Validation
The regression model will be used to estimate the value of the
response in samples when its value is unknown. Consequently,
it is necessary to have a measure of the error of this estimate,
the prediction error.
The procedure used to evaluate the prediction error, the prediction power of the model, is known as validation:
(a) single test set validation divides the samples available for the
calibration in two subsets, the training and the evaluation or
test set. The regression model is developed with the objects
in the training set. The prediction error is evaluated on the
objects of the test set, as the standard deviation of the error of
prediction on the test set (SDEPT , also indicated with rootmean-square error of prediction, RMSEP) computed on the
NT samples in the test set:
NT
i=1 (yi
SDEPT =
y i )2
PRESS =
N
(yi y i )2 .
i=1
NT
N
i=1 (yi
y i )2
(c) in Montecarlo or repeated test set validation, a large number of training and test sets are randomly created, generally
Fig. 5. The relationship between the CV percent explained variance and the
standard deviation of the prediction error.
67
Table 1
Measured and predicted moisture for samples 114 and 115
Sample
Moisture measured
Moisture predicted
114
115
5.60
9.51
9.68
5.49
in the case of K as response I obtained 10% Q2 , not an excellent result but promising. It is necessary to state firmly that
Q2 < 90% indicates a poor or very poor regression model. Very
small values of Q2 are the result of casual non-significant correlations: they do not indicate a relation between predictors and
response.
Frequently, validation is performed both with a test set and
with CV with the remaining objects. This because CV validation
is used to optimize the regression model (e.g., the number of
latent variables in PLS), and the optimised model is checked
with the test set.
This procedure has been utilized [8] in data analysis of data
set Hay-1. The samples were ordered, according to the value of
moisture, then 228 samples were assigned to the training set,
and 77 to the test set. The result of prediction is shown in Fig. 6.
There are two clear outliers, one in the training set, and one in
the test set.
Fig. 7 shows the results obtained without the test set, only with
CV validation. The order of the objects is the original order. The
Fig. 7. Data set Hay-1, response moistureerror of prediction for the objects
in original order.
1
(x mx )2
y = a + bx tp s
+ N
2
N
i=1 (xi mx )
1/2
(14)
68
SDEP
< < SDEP
(15)
2
2
97.5
2.5
What means that with = 20 the uncertainty is about 35%,
with = 50 the uncertainty is about 20%.
Eq. (15) provides a measure of the uncertainty of SDEP due
to the samples. We will indicate this uncertainty of SDEP with
Samples
the 95% range of Eq. (15) as RSDEP . In the case of data set
Samples
Soy-1 and response Moisture RSDEP is about 0.4, very large.
We apply Eq. (15) under some hypotheses, that the N samples
are randomly drown from the infinite population of all possible
samples, that residuals have a normal distribution. In practice,
this hypothesis is rarely respected. However, Eq. [11] supplies
a rough estimation of the uncertainty of SDEP.
Single test set validation cannot provide an estimation of
SDEP uncertainty. Instead, CV validation and Montecarlo validation can provide this information, e.g., in the case of CV
validation each cancellation group differently contributes to
SDEP. For each cancellation group g an estimate of SDEP can
be obtained, as:
Ng
i )2
g
i=1 (yi y
SDEPCV =
Ng
From the standard deviation of the mean of these estimates
SDEPmean
CV and from its standard deviation an estimate of
Samples
RSDEP can be obtained, provided that the number of objects
in the cancellation groups be not very small. In the case of data
set Soy-1 and response Moisture, with five cancellation groups
(12 objects in each cancellation group), we obtained for the standard deviation of SDEPmean
CV about 0.1, in good agreement with
Samples
the estimate of RSDEP from Eq. (15).
Unfortunately, commercial software, where validation is performed with the single test set or CV, never supplies this
information.
A second source of uncertainty (generally less important)
is that due to the validation procedure, and, in the case of CV
validation, also to the order of the objects. We will indicate this
uncertainty as RValidation
, the range of SDEP values obtained
SDEP
with different validation procedures.
Table 2 shows in the case of data set Soy-1 and response
Moisture the results obtained with two pre-treatments of data
and different CV groups. RValidation
is about 0.15. Table 3 shows
SDEP
as also the order of the objects has an important influence on the
value of SDEP, with RValidation
about 0.11.
SDEP
In spite of this uncertainty, the effect of pre-treatments, of
the regression technique and of the selection of predictors can
be evaluated.
The two pre-treatments in Table 2 can be compared row by
row (because in each validation group the prediction is performed on the same objects) by means of statistical test for
matching pairs, Student t-test on the difference or Wilcoxon
matching-pair test. Both conclude that in this case column centering is more efficient than column autoscaling.
Table 2
Data set Soy-1, response moisture, difference test to compare centering and
autoscaling
Validation groups
SDEP centering
SDEP autoscaling
Difference
2
3
4
5
7
10
20
30
60 (Leave-one-out)
1.104
0.992
0.958
1.097
1.040
1.060
1.023
1.047
1.034
1.143
1.026
0.954
1.139
1.061
1.093
1.036
1.059
1.044
0.039
0.034
0.004
0.042
0.021
0.033
0.013
0.012
0.010
Mean
SD
SD mean
RValidation
SDEP
Student t
Wilcoxon W
1.039
0.046
0.015
0.146
1.062
0.058
0.020
0.189
0.022
0.016
0.0052
4.3
1
SDEP
1
2
3
4
5
6
7
8
9
10
1.0967
0.9978
1.0433
1.0591
1.0272
1.0795
0.9871
1.0465
1.0463
1.1012
Mean
SD
SD mean
RValidation
SDEP
1.0485
0.0380
0.0120
0.1141
69
PLS-SDEP
PLS-SDEP
Pre-treatment
Only the two useful predictors
Also 98 noisy predictors with the same
range as that of the useful predictors
Also 98 noisy predictors with range 1/2
Also 98 noisy predictors with range 1/4
Also 98 noisy predictors with range 1/10
Centering
0.0013
0.4709
Autoscaling
0.0013
0.4983
0.1474
0.0387
0.0061
0.4983
0.4983
0.4983
Fig. 8. Effect of noisy predictors, all the predictors used in PLS. Lines of equal
value of SDEP/sb are shown.
70
Fig. 9. Effect of noisy predictors. Predictors for PLS selected by means of SOLS.
Lines of equal value of SDEP/sb are shown.
71
Table 5
Data set Hay-2, response moisture, results with different pretreatments, five CV
groups
SDEP
Original
SNV
Vast
1st Der
2nd Der
SNV + 1st Der
SNV + 2nd Der
Original autoscaling
SNV autoscaling
1st Der autoscaling
2nd Der autoscaling
0.2403
0.2145
0.2467
0.2491
0.2622
0.2259
0.2501
0.2444
0.2139
0.2469
0.2673
Original
SNV
Difference
3
5
7
10
20
LOO
RValidation
SDEP
Mean
0.2329
0.2403
0.2420
0.2541
0.2533
0.2444
0.0212
0.2445
0.2135
0.2145
0.2078
0.2106
0.2119
0.2050
0.0095
0.2105
0.0194
0.0258
0.0342
0.0435
0.0414
0.0394
0.0340
Fig. 11. Scheme of fusion combined with selection of predictors with I-PLS.
From
To
Objects
60 objects, KennardStone
selection
2
4
6
8
10
12
14
4
6
8
10
12
14
16
1
7
29
212
47
8
1
0.0579
0.4964
0.2095
0.1879
0.2401
0.3004
0.3281
0.0345
0.3684
0.2668
0.2701
0.3136
0.2503
0.2225
72
73
Fig. 14. Kalivas data PC plot of SNV autoscaled data (left) and plot of responses (right).
These procedures require the use of some transfer calibration samples, to be measured with the same instrument
at different times, or with different instruments. In the case
of updating the transfer samples are not related to a specific
response, and the main objective is the updating of spectra. The
transfer samples must be very stable, reproducible, and representative, which means that the spectra must show variation of
absorbance all over the wavelength interval.
In the case of two or more instruments and of a specific
response slope-and-bias correction is a very simple procedure,
to be used before trying to use more complex procedures.
The instruments must have the same predictors. The regression model of the master is applied with the predictors of the
slave instrument, to obtain the vector of the predicted responses
of master
y model
. The linear regression of the measured response
server
on the predicted responses gives the correction rule:
of master
y corrected
= a + bymodel
server
server
In the case of data sets Soy-1 and Soy-2, response Moisture and
of master
original data, the values in y model
heavily underestimate
server
Fig. 15. Prediction of moisture in the master and in the server (bottom points)
instruments.
74
reconsider the entire problem, e.g., the selection of the predictors, the instrument source of the physical information.
(14) Finally, in the case the previous efforts have been unsatisfactory, try to use ANN.
A final suggestion: chemometrics is a chemical discipline,
what means that Chemistry is much more important that metrics. An excess of chemometrics means too complex models,
under evaluation of chance correlations, difficulty to maintain
the model in the time. A correct use of chemometrics and a pragmatic use of statistical tests (valid under some rarely verified
hypothesis) can instead be very useful.
Abbreviations
Typical acronyms of chemometrics are largely used in the
text. An acronym can have a lot of significance (e.g., we found
114 different definitions for PCA). So the acronyms used in the
text are collected here.
ACE
AFA
ANN
CV
DWT
EMSC
EMSCL
GA
GOLPE
IPLS or I-PLS
IPW
ISE
MAXCOR
MLF
MLR
MSC
MUT
NIR
NIRS
OLS
OSC
PC
PCA
PCR
PLS
PRESS
Q2
RValidation
SDEP
Samples
RSDEP
R2
RMSECV
SDReference
RR
SDEC
SDEP
SEP
SEP
SNV
SOLS
UV
UVE
WPT
Acknowledgment
Study developed with funds PRIN 2006 (National Ministry
of University and Research, University of Genova).
Appendix A. Matrix algebra
For our purpose, a matrix is a rectangular array of real numbers. A matrix has:
a number of rows R; a number of columns C.
A matrix is written as: A or A (underlined capital or boldface).
We can also indicate the number of rows and columns as:
A(m n): e.g., A(2 30) for a matrix of 2 rows and 30
columns.
m An e.g.,: 2 A30
ARC only when R symbolizes the number of rows and C the
number of columns
A matrix with ONE column is a column vector, or simply a
VECTOR.
A matrix with a single row is a ROW VECTOR.
A (column) vector is written as x, xR
A row vector is written as xT , xTC .
A matrix with one row and one column is a SCALAR, written
as a lowercase Latin letter, as x.
Also the single element in a matrix is a scalar: the element in
row r and in column c of the matrix B
A SQUARE MATRIX has the same number of rows and
columns. This common number is the ORDER of the matrix.
A square matrix A is SYMMETRIC when arc = acr . This is
an example of square symmetric matrix:
23 19 12 2
19 15 0 8
A=
12
0
8
7
2
123
1 0 0
D = 0 2 0
0
The transpose B of a matrix A is the matrix formed by transposing the rows and the columns of A, i.e., the matrix for which
bik = aki .
2
3
2
2
8
75
K
amk bkp
k=1
76
1 2 3
5 2 7
A=
4 3 7
5
diagonal:
trA =
arr
r
a
22
|A| = a11
a32
a
a23
21
a12
a31
a33
a
a23
21
+ a13
a31
a33
a22
a32
The determinant of a diagonal matrix is the product of the elements on the diagonal:
|D| =
drr
r
0
An UNITY vector has all elements equal to 1:
1
A.5. Singular matrix
A square matrix is singular when its determinant is
zero.
We can see each row the matrix as a point in the threedimensional space of the columns. In the case of the matrix
A column 3 is such as ai3 = ai1 + ai2 . This is the equation of a
plane through the origin, so that the true dimensionality is 2.
Such a matrix is singular.
Consider the matrix
67 44 111
AT A = 44 33 77
111 77 188
The determinant is 18425 + 12100 30525 = 0.
Really, the determinant measures the dispersion in the space
whose dimensionality is the number of columns of matrix A.
When the number of rows is less than that of columns, the true
dimensionality cannot exceed the number of rows.
The RANK of a matrix, rank(A), is the dimensionality of the
subspace.
It is the maximum order of a non-singular square minor. In
the case of the matrix:
1 2 3
5 2 7
A=
4 3 7
5
1 2 3
B = 5 2 7
4 3 7
has determinant 1 (2 7 7 3) 2 (5 7 7 4) +
3 (5 3 2 4) = 7 14 + 21 = 0.
A.6. Matrix inversion
I
A NULL VECTOR is a vector with all elements equal to 0; it is
denoted by
Ai
A1 A = AA1 = I
In ordinary algebra y is the inverse of x when y x = 1. The
inversion is possible when x = 0. The matrix inverse is the analogous in matrix algebra, and a matrix can be inverted when its
determinant is =0.
The inverse of a square matrix of order 2 must verify:
1
1
a11 a12
1 0
a12
a11
1
A A=
=
1
1
0 1
a21 a22
a21
a22
77
2
4
10
6 = 220
8
10
bT b = 2
AA+ A = A
A+ AA+ =A+
+ T
Multiplication of vectors
2
4
5
6 = 110
8
10
(AA ) = AA
(A+ A) =A+ A
T
If the inverse of
(AT A)
aT b = 1
exists, then:
T
T
A+
CR = (A A)CC ACR
s22
rs1 s2
D
D
C1 =
rs s
s12
1 2
D
D
s12 s22
r 2 s12 s22
a a= 1 2 3 4 5
3 = 55
4
5
bT a = 2
1
2
10
3 = 110
4
5
1 2
2 4
1
2
3
4
5
55
3 6 =
XT X =
2 4 6 8 10
110
4 8
5 10
110
220
An important case of application of matrix inversion to redundant equation systems is shown in this example.
A.10. Usual univariate regression
is an important example of application of matrix inversion:
yI,1
3.1
4.9
7.0
9.2
11.1
Matrix notation
yI,1 = XI,2 b2,1
XI,2
1
1
1
1
1
Scalar notation
y1 = a + bx1
y2 = a + bx2
XT2,I XI,2
5
15
1
(XT2,I XI,2 )
15
55
1.1
0.3
0.3
0.1
1
2
3
4
5
78
1
(XT2,I XI,2 ) XT2,I
b2,1 =
0.8
0.5
0.2
0.2
0.1
0.1 0.4
0.1
1
(BT2,3 B3,2 )
0.2
1
(XT2,I XI,2 ) XT2,I yI,1
(XT2,I XI,2 )
XT2,I yI,1 =
0.8
1
(BT2,3 B3,2 ) BT2,3
0.5
0.2 0.1
3.1
4.9
7.0
9.2
11.1
0.2
0
0.1 0.4
0.1
0.2
x2,1 =
0.229
0.033
0.010
0.010 0.015
0.121 0.108
=
0.061 0.081
3x + 2y = 1
2x + 4y = 3
5x + 8y = 8
3x + 2y = 1
5x + 8y = 8
b2,1 =
Matrices B2,2
3 2
3
2 4
5
2
8
0.8 3.1 + 0.5 4.9 + 0.2 7.0 0.1 9.2 0.4 11.1
0.2 3.1 0.1 4.9 + 0.0 7.0 + 0.1 9.2 + 0.2 11.1
b2,1 =
0.97 intercept
2.03 slope
0.069
3x + 2y = 1
2x + 4y = 3
0.858
b2,1 =
0.084
2
5
4
8
2x + 4y = 3
5x + 8y = 8
Determinants
8
34
36
Matrix notation:
B3,2 x2,1 = u3,1
with:
B3,2
= 2
5
2
T
x1,2 = x y
4 T
u1,3 = 1 3 8
8
Then: x2,1 =
38 26
T
B2,3 B3,2 =
26 84
79
which means:
i.e.,:
cos = cos
Therefore
h1,1 h1,2
h2,1
h2,2
cos
sin
The product:
cos
sin
sin cos
is:
cos = sin
sin
cos
cos
sin
sin
cos
cos2 + sin2
sin2 + cos2
e1 = x cos + y sin
e2 = x sin + y cos
Appendix B. Scaling
Frequently, the variables are of different nature (e.g., concentrations, absorbances, temperatures) or have different unit (e.g.,
molarity, ppm, %w/w).
Many chemometrics techniques are very sensitive to the
numerical size of the variables: e.g., when the same concentration is measured in mMolar unit instead of Molar unit, the
variance increases 1000000 times.
Variance measures variability; variability is information.
So, apparently the quantity of information depends on the
selected unit.
SCALING is used to eliminate this dependence on the unit
and on the nature of the variable.
Scaling can also eliminate some useless information from the
original data matrix. So, we call Scaling techniques a series
of simple data pretreatments.
These pretreatments, in spite of the mathematical simplicity,
are very important, a fundamental step in data analysis.
An important function of scaling is: to give the same starting
importance (as information content) to all the variables. Then, on
the basis of previous knowledge, the variables can be weighted
according to their recognized information content, relevant for
the specific problem, or, inversely, according to their uncertainty
(standard deviation).
80
Column pre-treatments: each variable is considered separately. The result depends on the objects used to compute the
scaling parameters.
(1) Column centering
ziv = xiv x v
eliminates systematic location difference among variables.
Centered variables have the same mean (0).
(2) Column standardization
xiv
ziv =
sv
eliminates scale effects (dispersion difference) among variables; it can increase difference in location. Standardized
variables have variance 1.
(3) (Column) autoscaling
The general scaling technique for variables of different
nature
ziv =
xiv x v
sv
xiv x v
sv
xiv x v x v
sv
sv
xiv x i
si
ziv = xiv
v=1 xiv
= xiv /V x i
i )(v v )
v=1 (xiv x
V
)2
v=1 (v v
a = x i bv
and
Ziv = Xiv X
(13) Global standardization
xiv
ziv =
st
with
st =
N V
i=1
v=1 (xiv
NV 1
x )2
v
ziv = wiv w
Ziv =
V
Xiv a
b
with
V
b=
i )(xv x )
v=1 (xiv x
V
xv x )2
v=1 (
and a = x i b x
81
xiv a c log(v)
b
Fig. B.1. (A) Smoothing parabola; (B) smoothing third-order polynomial. Window with 11 points.
82
Fig. B.2. Moving cubic spline through the first four original variables.
1
1
si + si+1
2
2
83
1
1
si si+1
2
2
si = a1,(i+1)/2 + d1,(i+1)/2
B.5. The bases
si+1 = a1,(i+1)/2 d1,(i+1)/2
In the example of Fig. B.4, with 16 data, the first level computes
eight approximations and eight details. In the second step the
filters are applied to the eight approximations computed in the
fist step:
a2,(j+1)/2 =
1
1
a1,j + a1,j+1
2
2
d2,(j+1)/2 =
1
1
a1,j a1,j+1
2
2
84
Table B.1
Mallat pyramid of Kaplan data set
32
10
20
38
37
28
38
34
18
24
18
23
24
28
34
21.000
25.000
29.625
25.937
29.000
34.250
22.250
3.6875
32.500
17.250
4.625
36.00
27.250
5.000
21.00
4.00
13.50
1.75
23.50
3.75
31.00
3.75
11.00
9.00
4.50
2.00
3.00
4.50
0.50
3.00
So:
After the first level, the application of the filters also to the
details produces an extended tree, the Wavelet Packet Transform,
WPT, shown in Fig. B.7.
WPT produces information also on the frequency structure
of the details, providing a more complete information.
Moreover, WPT has a larger choice of possible bases, those
of DWT and others. The number of possible bases is important
because a base can be simplified by elimination of approximation or details very small, to describe a signal, in the space of
wavelet frequencies, with a reduced number of variables (compression).
B.7. Orthogonalization
The coefficients of the low-pass filters of Haar wavelets
are:
h1 = 0.5
h2 = 0.5
h2j = 1
gj2 = 1
hj gj = 1
nonnormalised
tji
h1 = 1/ 2
h2 = 1/ 2
g1 = 1/ 2
g2 = 1/ 2
1+ 3
h1 =
4 2
3+ 3
h2 =
4 2
3 3
h3 =
4 2
1 3
h4 =
4 2
85
g1 = h4
g2 = h3
g3 = h2
g4 = h1
Each time a twin filter (low-pass, h, and high-pass, g) is applied,
to four consecutive elements of the signal, signal s, from si
to si+3 , with i odd, it produces one approximation and one
detail.
With V (integer power of 2) elements of signal, when
the filter is applied to sV 1 , the second element is sV , so
that other two elements are necessary. They are s1 and s2 .
In other words the signal is considered as periodic (for
this reason generally wavelets are applied after detrending).
B.9. The best base
One of the objectives of the wavelet user is compression.
The best base is selected taking into account the objective of
compression.
There are many criteria to select the compressed best base,
among them that based on a cut-off value and that based on
the Shannon entropy. The result is a base with many cancelled
elements.
In the case of many signals (objects) it is necessary to have
a common best base, i.e., the same new variables for all the
objects.
The choice of a common basis can be made with:
(1) the signal variance spectrumA2 2 ; it is a vector of V elements,
each element is the variance of each variable, computed
on the N samples; the wavelet transform and the best
base selection is performed on the signal variance spectrum. Then all the wavelet decomposition is performed on
all the objects and for each objects the elements retained
are those in the best base of the signal variance spectrum;
(2) the spectrum of the absolute correlation coefficients of each
variable with the response; then the selection for each object
is made as in point 1.
Principal component analysis (PCA) is a technique of orthogonal rotation, generally around the centroid of the data and after
autoscaling of the variables.
In the orthogonal rotation the variances on the axes, the
covariance and the correlation coefficients between the axes,
change.
The trace and the determinant of the variance covariance
matrix are constant in the rotation: the total dispersion (trace)
and the dispersion in the multidimensional space (determinant)
are invariant in rotation.
Fig. C.1 show an example of two centered variables, with
positive correlation coefficient.
After rotation with a suitable angle, in the rotated space
(Fig. C.2) the new variables (the principal components) are:
(a) not correlated;
(b) ordered according to their variance (eigenvalue). The first
component is the direction of maximum variance.
The variancecovariance matrix of the original variables is:
292.959
152.692
152.692
117.635
86
Fig. C.2. Objects of Fig. C.1 in the space of the two principal components.
Fig. C.4. Objects of Fig. C.1 (but with variable 2 modified) in the space of the
two principal components.
with
Trace: 410.594
Determinant: 11147.56
0
29.231
15.692
1.17635
with
Trace: 294.136
Determinant: 111.4756
0
0.379
0.8225
1
with
Trace: 2
Determinant: 0.3235.
0
0.1175
87
Fig. C.6. The elimination of the information related to the direction of maximum
variability.
Fig. C.5. Objects of Fig. C.4 in the space of the two principal components, after
autoscaling.
The only problem is that variables with very small range frequently are noisy variables, e.g., in spectroscopy small range
variables correspond to wavelengths where there is no absorption. Usually, chemists do not use these variables. In the case
of spectra, where all the variables have the same measurement
unit, PC can be computed without autoscaling.
In the case of V > 2 variables, PC rotation computes V PCs,
all non correlated, orthogonal, and they are ordered according
their decreasing eigenvalue.
Also with many variables, very frequently only the first components have a significant eigenvalue. The last components, with
very small variance, are generally related to the noise.
This is the case of spectra, with a very large correlation
among the variables. In the case of spectra of mixtures of
C components there are only C important principal components, the others are surely noisy. The significant mathematical,
abstract, components correspond to the real chemical components.
There are many statistical techniques to detect the number of
significant components. The old Kramer rule indicates as significant the components with eigenvalue larger than the mean
variance (the trace divided by the number of variables). Too frequently this rule gives bad results. Double-cross validation [19]
is the best technique for usual chemical problems.
The starting information (variability and correlations) is represented by the covariance matrix with V variances and V
(V 1)/2 covariances: with one thousand variables there are
500500 elements. After PC rotation with 10 significant components the information is reduced to 10.
PCA can be used:
(a) to visualize a large amount of information (not necessarily
but frequently useful information) in the plots of the first
PCs;
(b) to compress data, from V original variables to C significant
components.
In the usual terminology of chemometrics the coordinates in
the space of PCs are called scores, and the orthogonal rotation
S matrix of SCORES
where the orthogonal rotation matrix LT is reported as the transpose of L, the orthogonal matrix used in the inverse rotation:
XNV = S NV LVV
L matrix of LOADINGS
V
sia lav
a=2
bM = (XTMN XNM )
XTMN yN
88
all-subsets OLS;
stepwise OLS;
genetic algorithms selection;
stepwise orthogonalization.
All-subsets OLS is based on the use of all the possible combinations of a suitable number H < V of predictors. Very rarely
used, it requires generally a very long computing time.
Stepwise OLS (SOLS) is a popular and very useful technique:
(1) In each step SOLS selects the predictor that increases
more the explained variance. V univariate regression models
y = a + bv xv are computed. The predictor with the miniN
mum variance of the residuals sv2 =
(yi y i )2 /N 1 is
i
Fig. D.1. A small data set visualized in the space of two predictors and of the
response.
89
hA
i =
a=1
The objects are projected (t i = xTi w) on the first latent variable (Fig. D.4), and their scores, t1 , t2 , . . ., tN , are obtained:
The regression through the origin of the response on the first
latent variable gives the first PLS model, as function of the
scores ti (Fig. D.5) or, by means of the equation t i = xTi w of
the predictors (the so-called closed form of PLS).
The information on the first latent variable is eliminated from
both the predictors and the response (see Appendix C), and the
second latent variables is computed.
Fig. D.5. PLS models the response as a function of the latent variable.
Fig. D.6. SDEP as a function of the number of latent variables (data set Hay-1,
response moisture, centered data, five CV groups).
90
Fig. D.7. Data set Soy, response moisture, centered data, five CV groups. Validation repeated with 10 different orders of the objects.
91
evaluated. It continuously decreases. A set of objects (the optimization set) is used to evaluate SDEP. The learning stops when
SDEP reaches a minimum or is about constant (Fig. E.2). A third
set is frequently used to evaluate the true predictive performance
of the regression model.
Because of the optimization and the test set ANN require
many well selected objects. Because of the number of possibilities (compression, number of hidden layer and neurons, learning
rate) the selection of the best net and the learning procedure
can be very long when the number of predictors is large, so
that frequently they are applied to the principal components.
Also the elimination of useless predictors is a time consuming
task, performed by means of genetic algorithms [35]. ANN do
not provide many statistical information and must be considered as black-box tools. In our knowledge, they have not been
compared with tools as non-linear PLS, stepwise quadratic or
cubic regression, Alternating Conditional Expectations regression [36], methods that can give good results in the case of
non-linear relationship between predictors and response.
ANN software can be found easily, and ANN calibration
models, computed with a very large number of samples, are
sometimes sold with instruments.
Counter-propagation networks are constituted by a twodimensional arrangement of neurons (Fig. E.3).
The neurons in the first layer contain a vector of V data (corresponding to the V predictors), those in the second layer contain
a single value, corresponding to the response.
Fig. E.2. MLF applied to Kalivas data, optimization set with 25 objects. Predictors: 15 PCs of autoscaled data.
92
F.5. Kalivas
NIR spectra of 100 flour samples [38]. Wavelength range
from 1101 to 2502 nm with 2 nm interval (701 predictors). The
responses were moisture and protein. The second derivative of
the spectra, obtained by means of third-order smoothing polynomial through eleven points, was used.
References
Fig. E.4. Contour map of the response % protein in the Grossberg layer (data
set Kalivas).
93