You are on page 1of 33

Journal of Chromatography A, 1158 (2007) 6193

Review

Multivariate calibration
M. Forina , S. Lanteri, M. Casale
Department of Pharmaceutical and Food Chemistry and Technology,
University of Genova, Via Brigata Salerno 13, 16147 Genova, Italy
Available online 28 March 2007

Abstract
The bases of multivariate calibration are presented with special attention to some points usually not considered or underevaluated, i.e., the
sampling design, the number of samples necessary to obtain a reliable regression model, the effect of noisy predictors, the significance of the
parameters used to evaluate the performance ability of the regression model.
2007 Elsevier B.V. All rights reserved.
Keywords: Multivariate calibration; Chemometrics; Regression

Contents
1.
2.
3.
4.
5.
6.
7.

8.

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Base knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Usual univariate analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Multicomponent analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Inverse calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Multivariate calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.1. Sufficient number of well selected samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2. Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3. Predictive optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4. Sufficient number of well selected predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.5. Pretreatments and fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.6. Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.7. Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.8. Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.9. Updating and transfer of calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Appendix A. Matrix algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.1. Scalar multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.2. Inner multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.3. Trace and determinant. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.4. Identity matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.5. Singular matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Corresponding author. Tel.: +39 0103532630; fax: +39 0103532684.


E-mail address: forina@dictfa.unige.it (M. Forina).

0021-9673/$ see front matter 2007 Elsevier B.V. All rights reserved.
doi:10.1016/j.chroma.2007.03.082

62
63
63
63
63
63
64
65
66
69
69
70
71
71
72
72
73
74
75
75
75
75
75
76
76

62

M. Forina et al. / J. Chromatogr. A 1158 (2007) 6193

A.6. Matrix inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


A.7. Moore-Penrose generalized matrix inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.8. The inverse of the covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.9. Vector premultiplied by its transposed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.10. Usual univariate regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.11. Matrix solution of the global system x2,1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.12. Orthogonal matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Appendix B. Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.1. Scaling procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.2. Smoothing and derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.3. Padding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.4. Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.5. The bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.6. Wavelets packet transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.7. Orthogonalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.8. Daubechies wavelets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.9. The best base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Appendix C. Principal components analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Appendix D. Elimination of useless predictors and biased regression techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
D.1. Selection techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
D.2. Biased regression techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Appendix E. Artificial neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Appendix F. Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
F.1. Soy-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
F.2. Soy-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
F.3. Hay-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
F.4. Hay-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
F.5. Kalivas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1. Introduction
In the last few decades multivariate calibration has become an
important analytical tool in many different fields of application,
especially on food chemistry, pharmaceutical analysis, agriculture, environment, industrial and clinical chemistry. It is used
both for the determination of chemical species and of physical
quantities of interest in the chemical industry (e.g., octane number, viscosity), both in batch samples and in process control. It
is used also for the prediction of sensory scores, of biological
activity, of toxicity.
The reason of the large interest in multivariate calibration is
that the analytical procedure is fast and cheap, not very accurate
but enough accurate for many real problems.
The interesting chemical or physical quantity (response)
is obtained as a function of many measured quantities (predictors). Multivariate calibration uses non-specific predictors,
generally physical information from spectra, especially nearinfrared (NIR) spectra. However, it can be used with UV, visible,
Raman, mid-infrared, fluorescence, NMR and mass spectra,
with electrochemical predictors and with chemical predictors.
Moreover, in the study of biological activity, the predictors
are generally computed descriptors of the molecular structure.
The function that computes the response from the predictors
is obtained by means of chemometrics tools, able to extract from
many non-specific predictors a specific model.

76
77
77
77
77
78
78
79
80
81
81
82
83
84
84
84
85
85
87
87
88
90
92
92
92
92
92
92
92

Multivariate calibration is usually applied to complex real


matrices, where the separation procedures and chemical treatments necessary to apply the usual simple calibration (based on
only one specific predictor) are expensive and time consuming.
As in usual calibration a number of standards are necessary: they
are usually obtained by means of a reference technique.
Many books [1,2], reviews [35], and thousands of papers
(about two thousands only in the past two years) on multivariate calibration appeared in almost all the journals of analytical
chemistry, in the journals with special interest to specific application fields or to a selected technique (e.g., Journal of Near
Infrared Spectroscopy), and in the journals devoted to chemometrics. Many international conferences have been organized,
as the International Conference on Near Infrared Spectroscopy
(the 13th ICNIRS will be in Umea, Sweden, in June 2007),
with focus on the application, and the PLS International Symposium (PLS07, 5th International symposium on PLS, will be in
Oslo, Norway, in September 2007), with focus on the theoretical
background of the regression techniques.
Computer programs, as SIMCA (Umetrics, Kinnelon, NJ),
Unscrambler (Infometrix, Woodinville, WA, USA), PLS toolbox
(Manson, WA, USA), Syrius (PRS, Bergen, Norway), PARVUS
[6], of generally good quality are available to apply chemometrics techniques to the measured data. The commercial software
can be generally used with a very limited knowledge of chemometrics, what not always can be considered positive. It contains
only the most important techniques, so that the refinement of

M. Forina et al. / J. Chromatogr. A 1158 (2007) 6193

the calibration model can be unsatisfactory. In many cases the


software is part of the instrument, and data treatment is almost
authomatic and blind.
Here, we will present the bases of the multivariate calibration, without too many details on the chemometrics techniques,
with a special attention to the requirements and to the validation
procedures, to the most common sources of errors and to the criteria to evaluate the results, i.e., the evaluation of the accuracy
of the multivariate calibration.
2. Base knowledge

63

The usual model (3) is generally applied with more than


the two necessary samples and with the use of minimum least
squares regression, with the hypothesis that the concentration
of the standards has no error. There are many cases where the
error on this concentration is rather large and instead the error
on the physical measurements is small. In these cases the inverse
regression, of the response on the physical variable, can be performed. The corresponding model is
y = a + bx

(4a)

5. Multicomponent analysis

Some base theory is collected in the appendices, in a form,


we hope, simple enough also for people without knowledge of
chemometrics.
Appendix A: Elements of matrix algebra
Appendix B: Scaling Smoothing Derivatives Padding
Compression
Appendix C: Principal component analysis
Appendix D: Elimination of useless predictors and biased
regression techniques
Appendix E: Artificial neural networks

Multicomponent analysis was the first attempt to eliminate


interference by means of mathematics.
Be S the number of absorbing species in the system, v the
number of wavelengths, xv the absorbance measured at the wavelength v, xV the vector of the V absorbancies, avs the molar
absorptivity of the species s at wavelength v, AVS the corresponding matrix with V rows and S columns, ys the molarity of
the species s, yS the corresponding vector. The matrix equation:
xV = AVS yS

(5)

can be solved to obtain the vector of the concentrations:


1

3. Data sets

yS = (ATSV AVS )

The description of the data sets used as examples is reported


in Appendix F.

provided that V S and that the matrix ATSV AVS be invertible (at
least S independent equations).
In the case of mixtures of S chemical components only S 1
equations are necessary, because the last equation is obtained by
the condition:

4. Usual univariate analysis


The simplest model is
x = by

S


(1)
s=1

with the signal x (e.g., the absorbance measured at a selected


wavelength) proportional to the concentration y of the analyte.
In principle one standard sample, the pure analyte, is necessary
to measure x and to compute the model parameter b (calibration).
The model can be applied to found y of samples where it is
unknown, by means of the inverse equation:
y = b1 x

(2)

Really, people use a more complex model:


x = a + by

(3)

where the intercept take into account some factors as the absorption of the solvent. Two standard samples, with different values
of y, are necessary to compute the model parameters a and b.
Then
y = b1 (x a) = b1 (x + c)

(4)

The models (2) or (4) require that the signal be specific, due only
to the analyte in the case of model (2), to the analyte and to a
constant factor in the case of model (4).
So, the analysis requires physical and chemical treatments to
eliminate interferents.

ys
S

s=1 ys

ATSV xV = BSV xV

(6)

=1

Multicomponent analysis eliminates the interferences by


means of mathematics, but the knowledge of the molar absorptivity of all the species in the chemical system is necessary.
To obtain this knowledge only S standard samples, the pure
compounds, are necessary.
Matrix Eq. (6) corresponds to S separate equations, one for
each chemical species. However, this separation is only formal, because we need the absorptivity matrix and so the pure
standards of each compound.
6. Inverse calibration
The adjective inverse means that here the calibration is the
inverse of the usual calibration. Really, usual direct calibration
requires inversion to obtain the Eqs. (2), (4), and (6) useful to
predict the response y.
The starting point is Eq. (7) formally equal to Eq. (6)
yS = BSV xV

(7)

We can consider the case of only one response variable, so that:


y = xTV bV

(8)

64

M. Forina et al. / J. Chromatogr. A 1158 (2007) 6193

Is it possible to separate, really not only formally, Eq. (7) in S


equations, and, generally, consider only the interesting analyte,
not the interferents?
We need standard samples.
A really separated Eq. (8) means that only the concentration of the interesting response variable must be known in the
standard samples for calibration ion.
How many standard samples, with known value of the
response, are necessary?
At least N = V samples, because we need V equations, the
system:
yN = XNV bV

(9)

where XNV is the matrix of the predictors.


How many predictors?
With only one predictor Eq. (9) coincides with Eq. (2): no
possibility of elimination of interferents. So, we need a number
of predictors sufficient to take into account, in the regression
coefficients b, the effect of interferents.
The model (9) can be increased to take into account the
effect of constant factors, with an intercept bV + 1 with the same
significance of c in Eq. (2):
yN = XNV bV + bV +1 = XNM bM

(10)

where M = V + 1, and XNM is the increasing matrix of the predictors, with added a columns of 1.
The vector of the coefficients can be computed by means of:
bM = (XTMN XNM )

XTMN yN

(11)

Eq. (11) is the minimum least squares solution of system (10);


the solution provided by ordinary least squares (OLS) regression
(also known as multiple linear regression, MLR).
By substituting bM from Eq. (11) in Eq. (10) we will not
obtain the vector yN of the response, but their least squares
estimate:
y N = XNM (XTMN XNM )

XTMN yN

(12)

The inversion of the matrix XTMN XNM (frequently called the


information matrix) requires that the number of standard sample
be at least equal to M. Moreover, the samples must be different,
what means they must have different value of the vector of the
predictors.
Inverse calibration is usually applied to spectral data, where
M can be more than one thousand. So, the number of standard
samples can be too large for practical purposes.
There are some possibilities:
(a) selection of a subset of predictors (the base technique
remains OLS);
(b) compression (the nature of the predictors is modified);
(c) use of biased regression techniques (different from OLS);
(d) combination of selection, compression and biased techniques. Use of selection procedures typical of biased
techniques.

OLS is considered the unbiased regression technique.


Roughly, the hypothesis is that the predictors are all necessaries
and relevant. A chemical mixture is a linear combination of the
constituents, each weighted for its percentage. When we need to
reconstruct perfectly a mixture we must use all the constituents,
also the minor or trace constituents. Practically we can use only
the relevant constituents. The elimination of trace constituents
reduces the reconstruction cost and can give other advantages
(e.g., elimination of unwanted contaminants): we introduced a
bias but the resulting mixture can be satisfactory and sometimes
better.
In chemical calibration the response cannot be considered as
a mixture of the predictors. Many predictors can be completely
useless, so that their elimination has always a positive effect.
The matrix of the predictors is the result of experiments, and
measurements are always characterized by a random error. In
the case the predictors are all necessaries and relevant and yN is
the true value of the vector of the responses, and we repeat H
times the experimental determination of the matrix XNV , on the
same samples, an unbiased technique has the property
limH E(yN ) = yN

(13)

where E(y N ) is the mean over the H repetitions of the estimate


of the vector of computed responses.
However, yN is not the true value of the vector of the
responses. They have an error, generally the error of the reference technique. For these reasons (not all the predictors are
necessaries and relevant, yN is not the true value) OLS cannot
be considered a really unbiased technique.
Biased techniques introduce a small systematic error on the
response but with a more important decrease of the random error,
so that they can offer some advantage also when the matrix
XTMN XNM can be inverted.
7. Multivariate calibration
What we call Multivariate calibration is an inverse calibration:
(a) without the knowledge of the spectra of the responses and
of all the interferents;
(b) with the possibility to study separately each response;
(c) with a number V of well selected predictors sufficient to take
into account the effect of interferents;
(d) with a number N of well selected samples, sufficient to
explore the variability of the chemical systems on which
the regression model has to be applied; N must be sufficient
also to evaluate the predictive ability of the regression model
by means of the usual validation techniques;
(e) generally with the use of unbiased regression techniques.
The above definition of multivariate calibration introduces
all the critical points encountered during the development of a
calibration model:
1. sufficient number of well selected samples;
2. predictive power and validation;

M. Forina et al. / J. Chromatogr. A 1158 (2007) 6193

65

model has general validity and the leverage points stabilize the
model.
In the case of multivariate calibration the same people can
use training sets with very heavy lack of uniformity.
The available samples to develop a multivariate calibration
model are:

Fig. 1. An (bad) example of usual univariate calibration.

3. sufficient number of well selected predictors (and consequently pre-treatment of the predictors, choice of the
regression technique and of the related techniques of elimination of useless predictors).
7.1. Sufcient number of well selected samples
The number of samples must be sufficient to:
(a) evaluate SDEP with a reduced uncertainty;
(b) explore all the factors of variability in the chemical samples
(especially chemical and physical matrix effects, but also
instrumental factors);
(c) have both training and test sets representative of the above
factors.
The samples must be selected to have a distribution close to
the uniform distribution.
No people select samples in the case of the usual univariate
calibration as shown in Fig. 1. For two reasons: first, the two right
samples have high leverage, so a large influence on the regression
model; second, the standard deviation of the residuals is a mean
measure, heavily influenced by the left samples, as shown in
Fig. 2. The presence of high leverage points always indicates
a bad distribution of the samples and sometimes that the linear
model is valid in a limited range. However, in some cases the

Fig. 2. The absolute value of the residuals for the example in Fig. 1. (A) Standard
deviation of the fitting error; (B) the same but without the contribution of the
two samples at the right.

(a) all the samples analyzed in the laboratory. For each sample
the spectrum and the response variables measured with a
reference technique are available;
(b) a number of samples selected among many candidate samples. For each candidate sample the spectrum is available.
The response variables will be determined only on the
selected samples.
In both cases a good sampling design selects the samples for
calibration with a uniform distribution.
In the first case the design can be performed both on the X
matrix (the matrix of the spectra) and on the y vector or the Y
matrix (the vector of the response or the matrix of the responses).
In the second case the design can be performed only on the X
matrix.
There are many techniques for uniform design. Kennard
Stone design [7] can be used to obtain one or more sets of
samples (e.g., a calibration and a test set).
Figs. 3 and 4 show the results with KennardStone sampling and selection of two sets. In the case of Fig. 3 the
design was performed on the two response variables, in the
case of Fig. 4 on the two first principal components. The two
sets can be used: or one for training and the second as
test set; or joined to use CV for validation. In both cases
the sets are well balanced, without a too large number of
samples with very similar characteristics. The KennardStone
designs should compared with the design used by Ruisanchez
et al. [8], were the responses are sorted, then the ordered
samples are assigned to the training and the test sets (the
first two to the training set, the third to the test set, and so
on).

Fig. 3. Data set Hay selection of 60 samples KennardStone Duplex on


response variables.

66

M. Forina et al. / J. Chromatogr. A 1158 (2007) 6193

with a preselected probability of assignment. The response


of each sample is predicted many times, by different models obtained with the different training sets. The standard
error of prediction is computed on the total number T of
predictions:

T
i )2
i=1 (yi y
SDEPMC =
T
In the case of cross-validation frequently the quantity
predictive residual error sum of squares (PRESS) is also
used:
Fig. 4. Data set Hay selection of 60 samples KennardStone Duplex on the
two first components of predictors.

7.2. Validation
The regression model will be used to estimate the value of the
response in samples when its value is unknown. Consequently,
it is necessary to have a measure of the error of this estimate,
the prediction error.
The procedure used to evaluate the prediction error, the prediction power of the model, is known as validation:
(a) single test set validation divides the samples available for the
calibration in two subsets, the training and the evaluation or
test set. The regression model is developed with the objects
in the training set. The prediction error is evaluated on the
objects of the test set, as the standard deviation of the error of
prediction on the test set (SDEPT , also indicated with rootmean-square error of prediction, RMSEP) computed on the
NT samples in the test set:


NT
i=1 (yi

SDEPT =

y i )2

PRESS =

N


(yi y i )2 .

i=1

To evaluate the prediction ability very frequently people use the


cross-validated percent explained variance of the response:


SDEP2
2
2
RCV or Q = 100 1
sb2
where sb is the before regression standard deviation of the
response. This should be computed as:
N
(yi y G )2
2
sb = i=1
N
where y G is the mean of the response computed over the objects
in the training set of the Gth cancellation CV group. However,
frequently it is used the mean of all the N objects used for
calibration.
Q2 is a deceptive parameter. The relationship between Q2
and SDEP/sb is shown in Fig. 5. A value 80% of Q2 corresponds to SDEP/sb about 0.5, what means that the result of the
regression is very poor. Very frequently regression models have
Q2 < 50%. We remember in a Conference someone asserting

NT

(b) in cross-validation (CV) a number G of cancellation groups


or segments is selected. The N objects are assigned to one of
the cancellation groups by means of a systematic procedure,
i.e., object 1 is assigned to the first group, object 2 to the
second, . . ., object G to group G, object G + 1 to group 1,
and so on. The regression model is computed G times, and
each time the objects of the cancellation group constitute the
test set. The response of each object is predicted one time.
The standard deviation of the prediction error is obtained as
SDEPCV also known as standard error of prediction (SEP),
root-mean-square error of cross validation (RMSECV):

SDEPCV =

N

i=1 (yi

y i )2

(c) in Montecarlo or repeated test set validation, a large number of training and test sets are randomly created, generally

Fig. 5. The relationship between the CV percent explained variance and the
standard deviation of the prediction error.

M. Forina et al. / J. Chromatogr. A 1158 (2007) 6193

67

Table 1
Measured and predicted moisture for samples 114 and 115
Sample

Moisture measured

Moisture predicted

114
115

5.60
9.51

9.68
5.49

two outliers were consecutively analyzed, in the same work


session. The details are shown in Table 1.
The interpretation is straightforward. The subdivision
between training and test sets caused the loss of information
very important in the interpretation of the results. Data analysis
must always preserve the contact with the laboratory, with the
chemistry.
SDEP is used to:
Fig. 6. Data set Hay-1, response moistureerror of prediction for the objects
divided in training (left) and test set (right). Objects in the two sets ordered
according to the moisture percent.

in the case of K as response I obtained 10% Q2 , not an excellent result but promising. It is necessary to state firmly that
Q2 < 90% indicates a poor or very poor regression model. Very
small values of Q2 are the result of casual non-significant correlations: they do not indicate a relation between predictors and
response.
Frequently, validation is performed both with a test set and
with CV with the remaining objects. This because CV validation
is used to optimize the regression model (e.g., the number of
latent variables in PLS), and the optimised model is checked
with the test set.
This procedure has been utilized [8] in data analysis of data
set Hay-1. The samples were ordered, according to the value of
moisture, then 228 samples were assigned to the training set,
and 77 to the test set. The result of prediction is shown in Fig. 6.
There are two clear outliers, one in the training set, and one in
the test set.
Fig. 7 shows the results obtained without the test set, only with
CV validation. The order of the objects is the original order. The

Fig. 7. Data set Hay-1, response moistureerror of prediction for the objects
in original order.

(1) evaluate the predictive performance of the model;


(2) to compare different models, e.g., those obtained with different regression techniques or for the same technique with
different settings;
(3) to compare the results of multivariate calibration with those
obtained with the reference technique, characterized by its
standard deviation SDReference .
In the case of the usual linear regression, slowly, in the past
century, analytical chemists learned to use suitable statistical
tools to evaluate the predictive ability of the regression model.
It is given by a confidence interval of the unknown that contains
the variance of the fitting error, the leverage and the Student t
critical value. Student statistics incorporates the uncertainty on
the variable (normally distributed) and the uncertainty on the
variance (with 2 distribution).
With reference to the model (4a), it is:


1
(x mx )2
y = a + bx tp s
+ N
2
N
i=1 (xi mx )

1/2
(14)

where x is the value of the physical quantity from which the


value y of the response is predicted, xi is one of the N points
used in calibration, mx is the mean of the N values xi , s is the
standard deviation of the fitting error, tp is the critical value of
the Student distribution at probability level p. The term between
brackets is the leverage.
Instead actually the evaluation of the prediction error in multivariate calibration is rather rough, and both the uncertainty on
the variance and the leverage are not considered.
SDEP is an experimental measure. Its reliability depends on
the total number of objects, on the number of objects in the
test set, NT , on the number of groups of cross-validation, on the
order of the objects in the case G < N (when G = N the validation
procedure is known as leave-one-out, and SDEP does not depend
on the order of the objects).
(SDEP2 / 2 ), where 2 , the unknown true value, is distributed as a 2 variable with degrees of freedom. is NT
in the case of single test set, N in the case of CV or Montecarlo
validation.

68

M. Forina et al. / J. Chromatogr. A 1158 (2007) 6193

The 95% confidence interval of 2 is obtained by the critical


values of the 2 distribution as:



SDEP
< < SDEP
(15)
2
2
97.5
2.5
What means that with = 20 the uncertainty is about 35%,
with = 50 the uncertainty is about 20%.
Eq. (15) provides a measure of the uncertainty of SDEP due
to the samples. We will indicate this uncertainty of SDEP with
Samples
the 95% range of Eq. (15) as RSDEP . In the case of data set
Samples
Soy-1 and response Moisture RSDEP is about 0.4, very large.
We apply Eq. (15) under some hypotheses, that the N samples
are randomly drown from the infinite population of all possible
samples, that residuals have a normal distribution. In practice,
this hypothesis is rarely respected. However, Eq. [11] supplies
a rough estimation of the uncertainty of SDEP.
Single test set validation cannot provide an estimation of
SDEP uncertainty. Instead, CV validation and Montecarlo validation can provide this information, e.g., in the case of CV
validation each cancellation group differently contributes to
SDEP. For each cancellation group g an estimate of SDEP can
be obtained, as:

Ng
i )2
g
i=1 (yi y
SDEPCV =
Ng
From the standard deviation of the mean of these estimates
SDEPmean
CV and from its standard deviation an estimate of
Samples
RSDEP can be obtained, provided that the number of objects
in the cancellation groups be not very small. In the case of data
set Soy-1 and response Moisture, with five cancellation groups
(12 objects in each cancellation group), we obtained for the standard deviation of SDEPmean
CV about 0.1, in good agreement with
Samples
the estimate of RSDEP from Eq. (15).
Unfortunately, commercial software, where validation is performed with the single test set or CV, never supplies this
information.
A second source of uncertainty (generally less important)
is that due to the validation procedure, and, in the case of CV
validation, also to the order of the objects. We will indicate this
uncertainty as RValidation
, the range of SDEP values obtained
SDEP
with different validation procedures.
Table 2 shows in the case of data set Soy-1 and response
Moisture the results obtained with two pre-treatments of data
and different CV groups. RValidation
is about 0.15. Table 3 shows
SDEP
as also the order of the objects has an important influence on the
value of SDEP, with RValidation
about 0.11.
SDEP
In spite of this uncertainty, the effect of pre-treatments, of
the regression technique and of the selection of predictors can
be evaluated.
The two pre-treatments in Table 2 can be compared row by
row (because in each validation group the prediction is performed on the same objects) by means of statistical test for
matching pairs, Student t-test on the difference or Wilcoxon
matching-pair test. Both conclude that in this case column centering is more efficient than column autoscaling.

Table 2
Data set Soy-1, response moisture, difference test to compare centering and
autoscaling
Validation groups

SDEP centering

SDEP autoscaling

Difference

2
3
4
5
7
10
20
30
60 (Leave-one-out)

1.104
0.992
0.958
1.097
1.040
1.060
1.023
1.047
1.034

1.143
1.026
0.954
1.139
1.061
1.093
1.036
1.059
1.044

0.039
0.034
0.004
0.042
0.021
0.033
0.013
0.012
0.010

Mean
SD
SD mean
RValidation
SDEP
Student t
Wilcoxon W

1.039
0.046
0.015
0.146

1.062
0.058
0.020
0.189

0.022
0.016
0.0052
4.3
1

All the parameters used to evaluate the predictive ability have


a corresponding fitting parameter, as the standard error of the
calibration error SDEC, the percent of explained variance R2 ,
obtained with the values of the response computed by the model
on the objects of the training set. These parameters are not very
useful. However, when the difference between SDEC and SDEP,
between R2 and Q2 are large, the model is surely too complex,
with overfitting.
Sometimes a standard error of performance, indicated with
SEP and considered a measure of precision, is used.
To obtain SEP, the Bias is computed as:
IE
(yi y i )
Bias = i=1
N
Then SEP is obtained by means of equation:

N
i Bias)2
i=1 (yi y
SEP =
N 1
Table 3
Data set Soy-1, response moisture, five CV groups, SDEP obtained with 10
randomizations of objects order
Randomization

SDEP

1
2
3
4
5
6
7
8
9
10

1.0967
0.9978
1.0433
1.0591
1.0272
1.0795
0.9871
1.0465
1.0463
1.1012

Mean
SD
SD mean
RValidation
SDEP

1.0485
0.0380
0.0120
0.1141

M. Forina et al. / J. Chromatogr. A 1158 (2007) 6193

The parameter Bias is evaluated by means of a fitting procedure,


without information about its statistical significance and without
validation, so that its validity is at least questionable.
7.3. Predictive optimization
Many procedures (pretreatments, regression techniques) are
characterized by one or more parameters whose value is optimized, to have the minimum SDEP (predictive optimization).
This is the case of the number of latent variables in PLS, of
the ridge parameter in RR, of the number of epochs in ANN.
Also many selection techniques search for the combination of
predictors that minimizes SDEP.
A third set is frequently used to evaluate the true predictive performance of the optimized regression model, so that the
number of necessary objects increases.
In the case of PLS it seems not necessary to have this third
set. In other cases, as with ANN, especially when the number of
objects is not very large, the final model has been obtained by
forcing the parameters to predict well the response of the objects
used in predictive optimization, perhaps too well. So, the third
set is necessary.
7.4. Sufcient number of well selected predictors
Generally, chemists use potentially useful predictors for multivariate calibration. In the case of spectra, regions with large
noise or where both the response and the interferents do not
absorb are discarded. However, a more or less large number of
useless predictors are retained. Noisy predictors can have a very
important negative effect on the prediction ability.
The below example was obtained with synthetic data. The
response of the 100 objects was function of two predictors, with
a small noise added. The predictors have range 01.98 random
predictors, with range 1, or 0.5, 0.25, 0.10, were added to the two
useful predictors. The results obtained with PLS show (Table 4)
that:

69

noisy predictors have very small variance (e.g., when a spectral


region without absorbance is considered). After autoscaling all
the variances become 1, so that the effect of noisy predictors
increases. However, the example suggests caution in the selection of the spectral range and indicates that, when the results
with autoscaling are neatly worse than those with only centering,
probably too many noisy predictors are present.
Consequently, the regression model can be very often
improved by elimination of noisy predictors. Commercial software offers a limited choice of methods for the elimination of
noisy predictors. Depending on the problem, rather conservative
elimination techniques (as Martens uncertainty test (MUT), or
uninformative variable elimination (UVE)) or less conservative
ones (stepwise OLS, stepwise orthogonalization) can be used.
A second effect of noisy predictors is opposite, consequence
of chance correlations between predictors and response. This
effect can be very important in the case of a small number of
objects and of a large number of predictors. Fig. 8 shows the
95% confidence value of SDEP/sb as a function of the number
of objects and of noisy predictors, as obtained by PLS working
with all the predictors. Both the first and the second effects act,
so that the global effect is rather complex.
With 20 objects and 20 predictors a SDEP/sb value 0.82,
corresponding to 33% Q2 only because of chance correlations.
This does not seem dramatic.
However, when, to improve the regression model, we use a
technique of selection of informative variables, the first effect
is eliminated, and the second effect appears very important, as
shown in Fig. 9 where the predictors used by PLS were selected
by means of stepwise OLS. Here, with 20 objects and 20 predictors a SDEP/sb value 0.63, corresponding to 60% Q2 can be
obtained only by chance correlations.
All the techniques of the elimination of the useless predictors
work to decrease SDEP. A family of selection techniques (as GAPLS, I-PLS, ISE) search for a combination of predictors with the

(a) the influence of the noisy predictors is very large;


(b) this influence decreases with the decreasing range of the
noisy predictors;
(c) autoscaling (that eliminates the differences of range between
predictors) can enhance the effect of the noisy predictors.
The example is very pessimistic, because there are many
noisy predictors and because in spectral data very frequently the
Table 4
Effect of noisy predictors, synthetic data
Predictors

PLS-SDEP

PLS-SDEP

Pre-treatment
Only the two useful predictors
Also 98 noisy predictors with the same
range as that of the useful predictors
Also 98 noisy predictors with range 1/2
Also 98 noisy predictors with range 1/4
Also 98 noisy predictors with range 1/10

Centering
0.0013
0.4709

Autoscaling
0.0013
0.4983

0.1474
0.0387
0.0061

0.4983
0.4983
0.4983

Fig. 8. Effect of noisy predictors, all the predictors used in PLS. Lines of equal
value of SDEP/sb are shown.

70

M. Forina et al. / J. Chromatogr. A 1158 (2007) 6193

significantly different from 0. Noisy variables are not correlated,


in practice with small correlation coefficient, not significantly
different from 0. So in the case of spectral data the number
of independent noisy variables is not large: it can be obtained
by PC analysis, as the number of components that explain a
large fraction of the variance (e.g., 99%). This number is generally between 10 and 20. On the contrary when multivariate
calibration is applied in QSAR, where frequently the number
of molecules is between 20 and 30, the predictors (molecular
descriptors) are less correlated and the effect of noisy predictors
can be very large. For this reason, many QSAR studies have a
very limited validity.
7.5. Pretreatments and fusion

Fig. 9. Effect of noisy predictors. Predictors for PLS selected by means of SOLS.
Lines of equal value of SDEP/sb are shown.

minimum SDEP. Others (MUT, IPW) work on the coefficients


of the PLS regression model with the best predictive ability.
For this reason, again it is necessary to use a relatively large
number of objects to develop the regression model and to perform a preliminary selection of the predictors, based on the
chemical experience. In the case of spectral data the regions
where both the response and the possible interferents do not
absorb must be eliminated.
A favorable characteristic is the very large correlation among
useful predictors. Fig. 10 shows the correlation structure typical of NIR data. The correlation is very large, also between
the absorbances at far wavelengths. The minimum correlation
is that between the first and the last variables, with correlation
coefficient about 0.5, 95% confidence interval from 0.3 to 0.6, so

Fig. 10. Correlation structure of Kalivas predictors (correlation coefficient


between the abscissa predictor and the ordinate predictor).

Many simple and complex pretreatments are possible. Some


software for multivariate calibration try many pretreatments and
their combinations, and suggest the best solution, i.e., the solution with the minimum SDEP, generally without attention to
other qualities of the model, as the model complexity, and to
the variability of SDEP with the validation procedure. Taking
into account the other possibilities (selection of predictors, compression, regression technique) a single calibration problem can
have thousands of different solutions. Too many.
Some pretreatments are very simple and with only one possibility. Between them, the pretreatments that work on the single
object are preferable, e.g., SNV and MSC are very similar, but
the second requires information from all the objects.
Other treatments are rather complex. The goal of orthogonal
signal correction (OSC) [9] is to remove one or more directions
in the space of the predictors, orthogonal to the response. There
are at least five different OSC algorithms, and the user has also
to decide how many directions (orthogonal to the response and
between them) must be removed. The number of directions is
determined by means of predictive optimization, so that a third
set of objects to evaluate the true predictive performance of the
final model is necessary.
We suggest exploring the use of simple pretreatments, as SNV
followed by centering or autoscaling and the use of derivatives.
A reduced number of models will be obtained, easily comparable
taking into account the variability of SDEP due to the validation
procedure (RValidation
) and the standard deviation of the reference
SDEP
technique SDReference .
In the case the best SDEP is significantly larger than
SDReference the next step should be the use of the techniques of
elimination of useless predictors. The study of the residuals can
suggest the use of non-linear regression models, e.g., artificial
neural networks.
In the case of data set Hay-2, the results, shown in Tables 5
and 6 indicate that the models are satisfactory (SDReference was
0.24, with 95% uncertainty 0.02 about). The result with the simple SNV pre-treatment is very good, and significantly better than
the result obtained with the other models.
Fusion means to use predictors from different instruments
(what rarely happens in multivariate calibration) or to concatenate the predictors from different pretreatments. Simple fusion
increases very much the number of predictors, generally (in our

M. Forina et al. / J. Chromatogr. A 1158 (2007) 6193

71

Table 5
Data set Hay-2, response moisture, results with different pretreatments, five CV
groups
SDEP
Original
SNV
Vast
1st Der
2nd Der
SNV + 1st Der
SNV + 2nd Der
Original autoscaling
SNV autoscaling
1st Der autoscaling
2nd Der autoscaling

0.2403
0.2145
0.2467
0.2491
0.2622
0.2259
0.2501
0.2444
0.2139
0.2469
0.2673

experience) without improvement of prediction. When before


or after fusion a procedure of selection of relevant predictors
is used, sometimes the regression model can be significantly
better, as in the case of data set Soy-1, response Oil, with
the results in Fig. 11. Here, all the procedure was performed
with five CV groups, but the starting and the final results were
checked for RValidation
, about 0.04, and with the tests on the
SDEP
differences. However, the selection procedure (IPLS) works in
predictive optimization, so that the final result is probably too
optimistic.
7.6. Residuals
The study of the residuals, i.e., of the errors of prediction for
all the objects, is very important to obtain information about:
(a) Y-outliers;
(b) Non-linearity;
(c) Eteroscedasticity, i.e., the dependence of the error of prediction on the leverage or on the value of the response.
In the case of the data set Hay-2 Table 7 shows in the fourth
column that the error of prediction is relatively small in the
moisture intervals with a very large number of samples, relatively large for small and high values of the moisture, generally
those with larger leverage. This is an important result, because in
the case of this data set the real objective was to detect samples
with moisture >12%. The model obtained with all the available
samples shows too large errors just in the interesting interval of
Table 6
Data set Hay-2, response moisture, results with different CV groups, centered
data
CV groups

Original

SNV

Difference

3
5
7
10
20
LOO
RValidation
SDEP
Mean

0.2329
0.2403
0.2420
0.2541
0.2533
0.2444
0.0212
0.2445

0.2135
0.2145
0.2078
0.2106
0.2119
0.2050
0.0095
0.2105

0.0194
0.0258
0.0342
0.0435
0.0414
0.0394
0.0340

Fig. 11. Scheme of fusion combined with selection of predictors with I-PLS.

the moisture. Instead the model obtained with a well-balanced


training set, obtained by KennardStone design on the first ten
principal components, does not show a dependence of the error
on the moisture. Fig. 12 shows that the linear model is satisfactory and that the residuals are regularly distributed, both for the
60 objects used in the calibration and the others.
7.7. Outliers
Outliers are classified as Y-outliers and X-outliers.
Y-outliers are anomalous in the value of the response, i.e.,
with a very large value of the prediction error, compared with
SDEP. Y-outliers can be the consequence of the deterioration of
the sample between the measurement of the response by means
of the reference technique and the measurement of the predictors, or of a fortuitous very large error in the measurement of
the response, or a trivial error, so, generally the consequence of
bad laboratory practice.
Table 7
Data set Hay-2, Response Moisture, error of prediction as a function of the
response
Moisture

SDEP (five CV groups) with training set:

From

To

Objects

All the objects

60 objects, KennardStone
selection

2
4
6
8
10
12
14

4
6
8
10
12
14
16

1
7
29
212
47
8
1

0.0579
0.4964
0.2095
0.1879
0.2401
0.3004
0.3281

0.0345
0.3684
0.2668
0.2701
0.3136
0.2503
0.2225

72

M. Forina et al. / J. Chromatogr. A 1158 (2007) 6193

Fig. 13. Data set Soy-1 Response moisture Influence plot.


Fig. 12. Data set Hay-2, model with KennardStone selection.

The elimination of Y-outliers must be justified with the


identification of their origin. Otherwise, the final statement of
multivariate calibration can be The prediction error is about
the same as SDReference , but sometimes, for unknown reasons,
the error can be very large. An analytical procedure with this
quality is not acceptable.
X-outliers are anomalous in the predictors, or in the relation
between predictor and response.
They can be detected by means of:
(a)
(b)
(c)
(d)

PCA of the predictors (original or transformed);


The leverage;
The scores on the latent variables of PLS;
The residual variance, i.e., the fraction of the information in
the predictors not used by the regression model.

High leverage samples have leverage larger than 34 times


the mean leverage. High leverage samples are not necessarily
anomalous. On the contrary, when the prediction error is small
they stabilize the model. In some cases, leverage criterion to
detect and eliminate X-outliers can be very dangerous. In the case
of data set Hay, when all the objects are used for the calibration,
many objects have high leverage and generally small or large
value of the response. Their elimination reduces the useful range
of the response or increases the uncertainty for the prediction of
small and high values of the response. Instead, when only the
60 objects from KennardStone design are used for calibration
there are not high leverage objects.
Sometimes X-outliers can be detected by means of PC plot
or by means of the dendrograms of clustering techniques. A
good strategy to detect these outliers is based on the X-residual
variance. Fig. 13 shows the so-called influence plot in the case
of data set Soy-1. Object 47 is neatly separated by the other
because of the large X residual variance. However its error on
the response is not too large, so that its elimination is almost
without effect.
The detection of X-outliers is especially important when the
calibration model is applied in practice. The regression model is
valid only for samples similar to those used in calibration, without important differences in the chemical and physical matrix.

The samples with unknown value of the response must be


carefully checked to verify their similarity with the calibration
objects. PCA, leverage, X-residual variance can be used. The
use of T2 statistics (the multivariate generalization of Student
t-statistics) based on the Mahalanobis distance from the mean is
almost equivalent to the use of the leverage, and the evaluation
of the significance probability is correct only in the hypothesis
of normal distribution of all the predictors.
7.8. Clusters
PCA plots can reveal also the presence of clusters in the
space of predictors. Fig. 14 shows two clusters in the case of
Kalivas data set. They correspond rather well to the two clusters
of moisture.
Many people suggest in such a case of clustered predictor to
split the data and build a separate regression model for each cluster. Really, in the case of Kalivas data set the model developed
with the 41 objects with low moisture behaves significantly better than the total model. Instead, the model with the 59 objects
with high moisture has more or less the same SDEP as the total
model.
7.9. Updating and transfer of calibration
The instruments change in the time, because of wear and
repairs. Consequently, the regression model must be updated.
In some cases more instruments are used for the determination
of the same response, and the regression model is developed
with a first (Master) instrument, and then it is applied on the
other instruments (Slaves) for routine analysis. It seems advantageous to find a procedure to avoid the development of separate
calibration model on each instrument.
Several procedures [1015] have been suggested to solve this
problem: transfer of spectra, i.e., the reproduction of the spectra
of the master instrument from that of the slave; transfer of calibration model, i.e., the transformation of the original regression
model in a model suitable for the slave instrument; correction
of the results (predicted response) of the slave instrument with
the model developed on the master instrument; development
of a joint regression model for two or more instruments.

M. Forina et al. / J. Chromatogr. A 1158 (2007) 6193

73

Fig. 14. Kalivas data PC plot of SNV autoscaled data (left) and plot of responses (right).

These procedures require the use of some transfer calibration samples, to be measured with the same instrument
at different times, or with different instruments. In the case
of updating the transfer samples are not related to a specific
response, and the main objective is the updating of spectra. The
transfer samples must be very stable, reproducible, and representative, which means that the spectra must show variation of
absorbance all over the wavelength interval.
In the case of two or more instruments and of a specific
response slope-and-bias correction is a very simple procedure,
to be used before trying to use more complex procedures.
The instruments must have the same predictors. The regression model of the master is applied with the predictors of the
slave instrument, to obtain the vector of the predicted responses
of master
y model
. The linear regression of the measured response
server
on the predicted responses gives the correction rule:
of master
y corrected
= a + bymodel
server
server

In the case of data sets Soy-1 and Soy-2, response Moisture and
of master
original data, the values in y model
heavily underestimate
server

Fig. 15. Prediction of moisture in the master and in the server (bottom points)
instruments.

the response (Fig. 15). After the correction with


of master
y corrected
= 14.74 + 0.9388ymodel
server
server

the prediction is as good as for the first instrument.


The correction equation depends on the response and
on the pre-treatment. With the same data and SNV no
correction was required. In the case of response oil, the correction is unsuccessful with the original data, satisfactory with
SNV.
8. Conclusions
Multivariate calibration can be used to:
(a) evaluate the performance of a new chemometrics tool or
strategy, i.e., combination of pretreatments, padding, compression, fusion, regression technique;
(b) solve a real analytical problem without too much work in
chemometrics;
(c) solve a real analytical problem with the care usually
employed by analytical chemists.
In the first case generally people use literature data and the
analytical problem is not too important. The main objective
is to publish a research paper with the demonstration that the
presented method or strategy behaves very well. The attention
to the principles exposed in this review can make difficult the
realization of the objective.
In the second case it is possible to use a good instrument with
the proprietary software, and to execute carefully the instructions. The software will provide the best regression model, to be
used in analysis. The faith in the instrument and software will
guarantee satisfaction.
In the third case you will think and work a bit more, and
sometimes you will not be completely satisfied. The following
are some important points to be considered in this case. Some
are obvious, but frequently forgotten.

74

M. Forina et al. / J. Chromatogr. A 1158 (2007) 6193

(1) Select many samples to be sure that the variability of the


analytical system is well explored. Measure predictors.
Eliminate useless predictors on the basis of the chemical knowledge, remembering that useless predictors are
not correlated with both the response and the possible
interferents.
(2) Use a simple pretreament of the predictors, e.g., SNV in
the case of infrared data. By means of a PC plot evaluate
the presence of heavy X-outliers. In this case repeat the
determination and/or investigate to individuate the possible cause (origin of the sample, deterioration). Moreover,
evaluate the presence of clusters in the space of the predictors.
(3) Select, by means of KennardStone or equivalent design,
a sufficient number of samples (at least 50 in the case of
linear models, preferably one hundred in the case of non
linear models). Create a test set for the final evaluation,
in the case you will try to use ANN (or other non-linear
technique) or genetic algorithms selection, and you have
many objects with almost uniform distribution.
(4) Measure the response with the reference analytical technique. Repeat the determination at least two times. This
will reduce very much the probability to find Y-outliers.
Moreover multivariate calibration cannot give a prediction
error significantly lower than the error of the reference
technique. The repetition of the determination of the
response reduces this error.
(5) Repeat, for the selected samples the measurement of the
predictors. Remember always that the use of chemometrics
does not eliminate the need of good laboratory practice, no
useful results with low-quality chemical and physical data.
(6) Begin data analysis, remembering that the contact with
the analytical problem and with the laboratory must be
retained.
(7) Use simple pretreatments, as SNV followed by centering
or autoscaling and the use of derivatives in the case of NIR
data.
(8) Use PLS as first regression technique. Repeat cross validation with different number of CV groups and with
leave-one-out. Evaluate the variability of SDEP due to the
validation.
(9) Study carefully the residuals, to identify possible outliers
and eteroscedasticity. In the case of Y-outliers investigate
the possible origin.
(10) Compare the different pre-treatments, to identify the best
pretreatment (SDEP significantly lower).
(11) Compare the best SDEP with the standard deviation of
the reference technique. In the case the difference is not
significant the model is satisfactory.
(12) Otherwise, try to use techniques for the selection of
predictors. UVE and MUT are rather conservative and
do not overestimate the prediction performance, so that
they do not require special attention. With other selection technique always evaluate the possibility of random
correlation.
(13) Then, there are other possibilities, as fusion, compression.
The data analysis become heavy. Probably, it is better to

reconsider the entire problem, e.g., the selection of the predictors, the instrument source of the physical information.
(14) Finally, in the case the previous efforts have been unsatisfactory, try to use ANN.
A final suggestion: chemometrics is a chemical discipline,
what means that Chemistry is much more important that metrics. An excess of chemometrics means too complex models,
under evaluation of chance correlations, difficulty to maintain
the model in the time. A correct use of chemometrics and a pragmatic use of statistical tests (valid under some rarely verified
hypothesis) can instead be very useful.
Abbreviations
Typical acronyms of chemometrics are largely used in the
text. An acronym can have a lot of significance (e.g., we found
114 different definitions for PCA). So the acronyms used in the
text are collected here.
ACE
AFA
ANN
CV
DWT
EMSC
EMSCL
GA
GOLPE
IPLS or I-PLS
IPW
ISE
MAXCOR
MLF
MLR
MSC
MUT
NIR
NIRS
OLS
OSC
PC
PCA
PCR
PLS
PRESS
Q2
RValidation
SDEP
Samples

Alternating conditional expectations (regression)


Abstract factor analysis
Artificial neural networks
Cross validation
Discrete wavelet transform
Extended MSC
Logarithmic EMSC
Genetic algorithms
Generating optimal linear PLS estimations
Interval PLS
Iterative predictor weighting
Iterative stepwise elimination
Maximum correlation
Multi-layer feed-forward neural networks
Multiple linear regression (the same as OLS)
Multiplicative scatter correction
Martens uncertainty test
Near infra-red
NIR spectroscopy
Ordinary least-squares (regression)
Orthogonal signal correction
Principal component
Principal component analysis
Principal components regression
Partial least squares (regression)
Predictive residual error sum of squares
Percent variance explained in prediction
Range of the uncertainty of SDEP due to the validation

RSDEP

Range of the uncertainty of SDEP due to the samples

R2
RMSECV
SDReference
RR
SDEC
SDEP
SEP
SEP
SNV
SOLS
UV
UVE
WPT

Percent of explained variance (in fitting)


Root-mean-square error of cross validation (the same as SDEP)
Standard deviation of the error of the reference technique
Ridge regression
Standard deviation of the calibration error
Standard deviation of the error of prediction
Standard error of prediction (the same as SDEP)
Standard error of performance
Standard normal variate
Stepwise OLS
Unity variance (scaling, the same as autoscaling)
Uninformative variable elimination
Wavelet packet transform

M. Forina et al. / J. Chromatogr. A 1158 (2007) 6193

Acknowledgment
Study developed with funds PRIN 2006 (National Ministry
of University and Research, University of Genova).
Appendix A. Matrix algebra
For our purpose, a matrix is a rectangular array of real numbers. A matrix has:
a number of rows R; a number of columns C.
A matrix is written as: A or A (underlined capital or boldface).
We can also indicate the number of rows and columns as:
A(m n): e.g., A(2 30) for a matrix of 2 rows and 30
columns.
m An e.g.,: 2 A30
ARC only when R symbolizes the number of rows and C the
number of columns
A matrix with ONE column is a column vector, or simply a
VECTOR.
A matrix with a single row is a ROW VECTOR.
A (column) vector is written as x, xR
A row vector is written as xT , xTC .
A matrix with one row and one column is a SCALAR, written
as a lowercase Latin letter, as x.
Also the single element in a matrix is a scalar: the element in
row r and in column c of the matrix B
A SQUARE MATRIX has the same number of rows and
columns. This common number is the ORDER of the matrix.
A square matrix A is SYMMETRIC when arc = acr . This is
an example of square symmetric matrix:

23 19 12 2
19 15 0 8

A=

12
0
8
7
2

123

Note that the symmetry is relative to the diagonal from the


element in the first column of the first row to that in the last
column of the last row: this is the LEADING DIAGONAL of
the square matrix, or simply the diagonal.
A symmetric matrix in which all the elements are zero except
for the diagonal values is a DIAGONAL MATRIX. This is an
example of diagonal matrix:

1 0 0

D = 0 2 0
0

The transpose B of a matrix A is the matrix formed by transposing the rows and the columns of A, i.e., the matrix for which
bik = aki .



2
3
2
2
8

A = 2 5 has the transpose AT =


3 5 1
8 1
The transpose of a matrix A is written as AT .

75

The first row of the transposed matrix is the first column of


the original matrix, the second row of AT is the second column of A, the first column of AT is the first row of A, and so
on.
If A has R rows and C columns, AT has C rows and R columns.
The transpose of a (column) vector is a row vector; the transpose of a row vector is a column vector.
The addition of two matrices A and B is possible only when
the number of rows and the number of columns are the same
for the two matrices. The SUM matrix has the same numbers
of rows and columns as the original matrices. Each element of
the sum matrix is the sum of the corresponding elements of the
added matrices.
There are several ways to define the multiplication in matrix
algebra. The multiplication of a matrix by a scalar and the inner
multiplication are commonly used.
A.1. Scalar multiplication
The multiplication of a scalar k and a matrix A gives the
matrix P with the same number of rows and columns as A, and
each element of P is given by prc = karc .

 

2 4
6 12
P = kA = 3
=
1 2
3 6
A.2. Inner multiplication
It is the most important matrix operation.
The product:
CMP = AMK BKP
is possible only when the number of columns of the premultiplying matrix is the same as the number of rows of the
postmultiplying matrix (the two matrices are conformable for
multiplication, in the stated order).
If B can be premultiplied by A, not necessarily B can be
postmultiplied by A. The second operation is possible only when
the number of columns of B is the same as the number or rows
of A.
When two matrices can be multiplied, the result of the
multiplication is a matrix C with the number of rows of the
premultiplying matrix and the number of columns of the postmultiplying matrix, and each element of C is given by:
cmp =

K


amk bkp

k=1

The arrangement in Fig. A.1 shows the matrix product as a


number of product of two vectors, and makes it very easy to
understand this important matrix operation.
A.3. Trace and determinant
The trace of first order or simply TRACE of a square matrix
A, tr A, is a scalar, the sum of the elements on the leading

76

M. Forina et al. / J. Chromatogr. A 1158 (2007) 6193

A generic matrix A is singular if a vector x not 0 exists such


as Ax = 0 or AT x = 0.
Suppose that

1 2 3
5 2 7

A=

4 3 7
5

Fig. A.1. Product A4,2 B2,3 .

diagonal:

trA =
arr
r

where the sum is over the R rows-columns.


The DETERMINANT of a (square) matrix is denoted by
D = |A|. It is recursively defined.
The determinant of a scalar is the scalar itself.
The determinant of a square matrix A of order 2 is
|A| = a11 a22 a12 a21
The determinant of a matrix of order 3:

a11 a12 a13

A = a21 a22 a23


a31 a32 a33
is:


a
 22
|A| = a11 
 a32



a
a23 
 21
 a12 
 a31
a33 



a
a23 
 21
 + a13 
 a31
a33 


a22 

a32 

The determinant of a diagonal matrix is the product of the elements on the diagonal:

|D| =
drr
r

A.4. Identity matrix


It is a diagonal matrix with all elements equal to 1 on the
leading diagonal. It is indicated with the symbol:

0
An UNITY vector has all elements equal to 1:
1
A.5. Singular matrix
A square matrix is singular when its determinant is
zero.

We can see each row the matrix as a point in the threedimensional space of the columns. In the case of the matrix
A column 3 is such as ai3 = ai1 + ai2 . This is the equation of a
plane through the origin, so that the true dimensionality is 2.
Such a matrix is singular.
Consider the matrix

67 44 111

AT A = 44 33 77
111 77 188
The determinant is 18425 + 12100 30525 = 0.
Really, the determinant measures the dispersion in the space
whose dimensionality is the number of columns of matrix A.
When the number of rows is less than that of columns, the true
dimensionality cannot exceed the number of rows.
The RANK of a matrix, rank(A), is the dimensionality of the
subspace.
It is the maximum order of a non-singular square minor. In
the case of the matrix:

1 2 3
5 2 7

A=

4 3 7
5

the square minors of order 3 are 0, e.g.,

1 2 3

B = 5 2 7
4 3 7
has determinant 1 (2 7 7 3) 2 (5 7 7 4) +
3 (5 3 2 4) = 7 14 + 21 = 0.
A.6. Matrix inversion

I
A NULL VECTOR is a vector with all elements equal to 0; it is
denoted by

Ai

A non-singular square matrix A has a unique inverse A1 or


such as:

A1 A = AA1 = I
In ordinary algebra y is the inverse of x when y x = 1. The
inversion is possible when x = 0. The matrix inverse is the analogous in matrix algebra, and a matrix can be inverted when its
determinant is =0.
The inverse of a square matrix of order 2 must verify:

 


1
1
a11 a12
1 0
a12
a11
1
A A=
=
1
1
0 1
a21 a22
a21
a22

M. Forina et al. / J. Chromatogr. A 1158 (2007) 6193

77

2

4


10
6 = 220

8
10

A.7. Moore-Penrose generalized matrix inverse


Given a matrix ARC the generalized inverse or pseudoinverse
matrix in a unique matrix A+ with C rows and R columns which
satisfies:


bT b = 2

AA+ A = A
A+ AA+ =A+
+ T

Multiplication of vectors

2

4


5
6 = 110

8
10

(AA ) = AA

(A+ A) =A+ A
T

If the inverse of

(AT A)


aT b = 1

exists, then:

T
T
A+
CR = (A A)CC ACR

A.8. The inverse of the covariance matrix


A covariance matrix is obtained as the product of the centered
data XIV as:
CVV = XTVI XIV /(I 1)(I = number of objects)
In the bidimensional case:


s12 s12
C=
s12 s22
where the elements on the leading diagonal are the variances
and the element on the second diagonal is the covariance. The
correlation coefficient r is:


rs1 s2
s12
s12
r=
so that C =
s1 s2
rs1 s2
s22
The determinant D is D =
is:

s22
rs1 s2
D
D

C1 =
rs s
s12
1 2
D
D

s12 s22

r 2 s12 s22

so that the inverse

Note that when r = 0, D = s12 s22 so that:




2
0
1/s
1
C1 =
0
1/s22
Important examples of (inner) matrix products
A.9. Vector premultiplied by its transposed

1

2


T

a a= 1 2 3 4 5
3 = 55

4
5


bT a = 2


1

2


10
3 = 110

4
5

Matrix premultiplied by its transposed XT X

1 2



2 4 

1
2
3
4
5
55
3 6 =
XT X =

2 4 6 8 10
110

4 8
5 10

110

220

An important case of application of matrix inversion to redundant equation systems is shown in this example.
A.10. Usual univariate regression
is an important example of application of matrix inversion:
yI,1
3.1
4.9
7.0
9.2
11.1
Matrix notation
yI,1 = XI,2 b2,1

XI,2
1
1
1
1
1
Scalar notation
y1 = a + bx1
y2 = a + bx2

XI,2 cannot be inverted, but:


XT2,I yI,1 = XT2,I XI,2 b2,1
(XT2,I XI,2 )

XT2,I yI,1 = b2,1


XT2,I XI,2

5
15


1
(XT2,I XI,2 )

15
55

1.1
0.3

0.3
0.1

1
2
3
4
5

78

M. Forina et al. / J. Chromatogr. A 1158 (2007) 6193


1
(XT2,I XI,2 ) XT2,I

b2,1 =

0.8

0.5

0.2

0.2

0.1

0.1 0.4
0.1


1
(BT2,3 B3,2 )

0.2

1
(XT2,I XI,2 ) XT2,I yI,1


(XT2,I XI,2 )

XT2,I yI,1 =

0.8

1
(BT2,3 B3,2 ) BT2,3

0.5

0.2 0.1

3.1

4.9

7.0

9.2
11.1

0.2
0

0.1 0.4
0.1

0.2


x2,1 =

0.229

0.033

0.010

0.010 0.015

0.121 0.108
=
0.061 0.081

3x + 2y = 1
2x + 4y = 3
5x + 8y = 8
3x + 2y = 1
5x + 8y = 8

b2,1 =

Matrices B2,2


3 2
3
2 4
5

2
8

0.8 3.1 + 0.5 4.9 + 0.2 7.0 0.1 9.2 0.4 11.1

0.2 3.1 0.1 4.9 + 0.0 7.0 + 0.1 9.2 + 0.2 11.1


b2,1 =

0.97 intercept
2.03 slope

More equations than unknowns


as in the following example:
Usual notation:
3x + 2y = 1
2x + 4y = 3
5x + 8y = 8

0.069

When we have more equations than unknowns, we can split the


system

3x + 2y = 1
2x + 4y = 3

2.48 + 2.45 + 1.4 0.92 4.44


0.62 0.49 + 0 + 0.9.2 + 2.221

0.858

into sufficient subsystems:

b2,1 =

0.084

In the generalized inverse the first row gives the weights of


each point in the computation of the intercept, the second row
gives the weights for the slope.





2
5

4
8

2x + 4y = 3
5x + 8y = 8

Determinants
8

34

36

Vectors x2,1 (solutions of the sufficient systems)






0.250
0.235
0.222
0.875
0.853
0.861
A.11. Matrix solution of the global system x2,1
The global solution is the mean of the partial solutions
weighted by the corresponding determinant


0.229
0.858

Matrix notation:
B3,2 x2,1 = u3,1
with:
B3,2

= 2
5



2
T
x1,2 = x y


4 T
u1,3 = 1 3 8
8

Multiplying by the transpose of B:


BT2,3 B3,2 x2,1
= BT2,3 u3,1
1
(BT2,3 B3,2 ) BT2,3 u3,1

Then: x2,1 =


38 26
T
B2,3 B3,2 =
26 84

A.12. Orthogonal matrices


A non-singular square matrix A can be inverted, and the
inverse is a matrix A1 that satisfies the equation:
A1 A = I
where I is the identity matrix.
An orthogonal matrix is a square non-singular matrix for
which:
H1 = HT
This means that:
HT H = I
If H is orthogonal also HT is orthogonal.

M. Forina et al. / J. Chromatogr. A 1158 (2007) 6193

79

Fig. A.2. Direction angles of an orthogonal rotation.

This is an example of orthogonal matrix of two rows:



 

h1,1 h1,2
1 0
h1,1 h2,1
=
0 1
h1,2 h2,2
h2,1 h2,2

Fig. A.3. The coordinates in the rotated space.

which means:

i.e.,:

The product of the row vector [x y] by the orthogonal matrix


with the direction cosines:


cos sin
[x y]
sin cos

a. the sum-of-squares (line or column) is 1


b. the sum of cross-products (two lines or two columns) is 0.

gives the coordinates e1 and e2 on the rotated axes system which


has direction angles , , , (Fig. A.3):

h11 h11 + h21 h21 = 1


h12 h11 + h22 h21 = 0

h11 h12 + h21 h22 = 0


h12 h12 + h22 h22 = 1

Fig. A.2 shows the rotation angles of an orthogonal (rigid)


rotation of axes counter clockwise:
Consider the matrix:

 

h1,1 h1,2
cos cos
=
cos cos
h2,1 h2,2
This matrix is orthogonal.
We have:
cos = sin

cos = cos

Therefore


h1,1 h1,2

h2,1

h2,2

cos
sin

The product:


cos
sin
sin cos
is:


cos = sin

sin
cos

cos
sin

sin
cos

cos2 + sin2

cos sin sin cos

sin cos + cos sin

sin2 + cos2

i.e., the identity matrix.


The coordinates of a point P in the rotated system can be
obtained by means of the matrix H.

e1 = x cos + y sin
e2 = x sin + y cos
Appendix B. Scaling
Frequently, the variables are of different nature (e.g., concentrations, absorbances, temperatures) or have different unit (e.g.,
molarity, ppm, %w/w).
Many chemometrics techniques are very sensitive to the
numerical size of the variables: e.g., when the same concentration is measured in mMolar unit instead of Molar unit, the
variance increases 1000000 times.
Variance measures variability; variability is information.
So, apparently the quantity of information depends on the
selected unit.
SCALING is used to eliminate this dependence on the unit
and on the nature of the variable.
Scaling can also eliminate some useless information from the
original data matrix. So, we call Scaling techniques a series
of simple data pretreatments.
These pretreatments, in spite of the mathematical simplicity,
are very important, a fundamental step in data analysis.
An important function of scaling is: to give the same starting
importance (as information content) to all the variables. Then, on
the basis of previous knowledge, the variables can be weighted
according to their recognized information content, relevant for
the specific problem, or, inversely, according to their uncertainty
(standard deviation).

80

M. Forina et al. / J. Chromatogr. A 1158 (2007) 6193

B.1. Scaling procedures


There are a lot of scaling-pretreatment procedures; sometimes they can also be used in combination, cautiously.
Be XNV the original data matrix, and ZNV the transformed
data matrix. These are examples of nomenclature:
xiv , ziv scalar, elements of data matrix
xv or x v column mean, mean of variable v
xi . or x i row mean, mean of object I
x general mean
minv (xiv ) column minimum, of variable v
maxv (xiv ) column maximum, of variable v
sv standard deviation of variable v
si standard deviation of object I
wv weight of variable v
x mean object, vector of V x v

Column pre-treatments: each variable is considered separately. The result depends on the objects used to compute the
scaling parameters.
(1) Column centering
ziv = xiv x v
eliminates systematic location difference among variables.
Centered variables have the same mean (0).
(2) Column standardization
xiv
ziv =
sv
eliminates scale effects (dispersion difference) among variables; it can increase difference in location. Standardized
variables have variance 1.
(3) (Column) autoscaling
The general scaling technique for variables of different
nature
ziv =

xiv x v
sv

Autoscaled variables have mean 0 and variance 1.


(4) Pareto scaling
ziv =

xiv x v

sv

(5) VAST (variable stability) scaling


ziv =

xiv x v x v
sv
sv

(6) (Column) range scaling


ziv =

xiv minv (xiv )


maxv (xiv ) minv (xiv )

This equation produces a variable with range from 0


to 1; it can be easily modified to have range from 1

to 1, or other selected, asymmetrical or symmetrical,


range.
Row (object) pre-treatments: each object is separately
considered. The result depends on the variables used to
compute the scaling parameters.
(7) Row centering
ziv = xiv x i
eliminates systematic location difference among objects.
(8) Row autoscaling or standard normal variate (SNV)
ziv =

xiv x i
si

(9) Row profiles (when multiplied by 100: row percentages)


V

ziv = xiv

v=1 xiv

= xiv /V x i

After row profiles are computed, the dimensionality of


the data is reduced, so that generally the row profiles are
centered.
(10) Linear detrending, to be applied only when the index v
of the predictor has a physical significance (wavelength in
spectra, time)
ziv = xiv a bv
with
V
b=

i )(v v )
v=1 (xiv x
V
)2
v=1 (v v

a = x i bv

and

Linear detrending can be also performed by computing


the residuals from the line that connects the first and the
last point. This procedure can be useful in the case of
compression or smoothing by means of Fourier transforms.
(11) Quadratic detrending, to be applied only when the index
of the predictor has a physical significance (wavelength in
spectra, time)
ziv = xiv a bv cv2
Where the coefficients a,b,c are those of the least square
regression of xi on the index of predictor v (equivalent to
the wavelength in the case of spectra)
Mixed (column and row) pre-treatments. The result
depends both on the objects and on the variables used to
compute the scaling parameters.
(12) Global centering

Ziv = Xiv X
(13) Global standardization

xiv
ziv =
st

with

st =

N V
i=1

v=1 (xiv

NV 1

x )2

M. Forina et al. / J. Chromatogr. A 1158 (2007) 6193

(14) Double centering


Row centering followed by column centering, or (equivalent) column centering followed by column centering.
wiv : intermediate data matrix element.
wiv = xiv x i

v
ziv = wiv w

(15) Double profiles of correspondence analysis


Xiv

N
X
v=1 iv
i=1 Xiv

Ziv = 
V

The variancecovariance matrix of these profiles is the


correspondence analysis matrix.
(16) Multiplicative scatter correction (MSC), treatment used for
NIR data. MSC computes the mean of the objects (the mean
spectrum in the case of spectral data). Then the parameter
(intercept a and slope b) for the linear regression of each
object on the mean object. These parameters are used for
the transform.
Ziv =

Xiv a
b

with
V
b=

i )(xv x )
v=1 (xiv x
V
xv x )2
v=1 (

and a = x i b x

(17) Extended MSC (EMSC), treatment used for NIR data


xiv a cv d v2
ziv =
b
where a,b,c,d are the regression coefficients for the multiple regression of each object on the mean object, x , on the
index of predictor (the wavelength) and on the square of
the index.

81

(18) EMSC logarithmic (EMSCL), treatment used for NIR data


ziv =

xiv a c log(v)
b

B.2. Smoothing and derivatives


These treatments can be applied only when the index of the
predictor has a physical significance (wavelength in spectra,
time), i.e., in data with special importance in multivariate calibration. Always, these treatments are applied to the rows of the
data matrix, so that the result for each object is not influenced
by the other objects.
The objective of smoothing is the reduction of noise. Smoothing can be performed by means of Fourier transform, wavelets
decomposition or by means of moving least-squares polynomial, as in the case of SavitzkyGolay smoothing, based on a
parabola thorough an odd number of points (smoothing window). The value from the least-squares parabola in the central
point substitutes the original value.
The derivative of the parabola of SavitzkyGolay smoothing in the central point is used to obtain the vector of the first
derivatives of the object. Better, the derivative is obtained by
the third-order smoothing polynomial, not used in the simple
smoothing because the result is the same obtained with the
parabola. A relatively large window (at least seven points) must
be used. Fig. B.1 shows as the second-order and the third order
smoothing polynomial give the same value in the central point,
but the third order polynomial follows better the slope of the
original points.
B.3. Padding
Padding is used as a preliminary step in the case of compression of the signal by means of fast Fourier transform or wavelets.
In these cases the number of variables must be an integer power
of 2.
Padding can be performed by adding zeroes (preferably after
detrending of the signal).
A more efficient padding procedure is based on interpolation
with cubic spline.

Fig. B.1. (A) Smoothing parabola; (B) smoothing third-order polynomial. Window with 11 points.

82

M. Forina et al. / J. Chromatogr. A 1158 (2007) 6193

In the case of Fourier transform a trigonometric polynomial


fits the V values of the variables:

V/2 

2kv
2kv
aik cos
+ bik sin
Pi (v) = ai0 +
V
V
k=1

so that for the V integer values of v it is Pi (v) = xiv , what means


that we have a system of V equations, from which the V parameters of the polynomial, ai0, . . ., ai,V/2 , bi1 , . . ., bi,V/2 1 can be
obtained (bi,V/2 is always 0). ai0 is the row mean x i .
From the coefficients of the V/2 harmonics the amplitude of
each harmonic is computed, as

2 + b2 .
Fik = aik
ik

Fig. B.2. Moving cubic spline through the first four original variables.

Fig. B.3. Discrete wavelet transform and Mallat pyramid.

We use a moving spline with a four-points window. Through


the four points three third-order polynomials are computed, with
some constraints. The second (B in Fig. B.2) is used for interpolation, to obtain the padded values. In the example B gives the
values of the padded variables from 5 to 7. The first third order
polynomial A is used only for the first padded variables (in the
example from 1 to 4). The last polynomial C is used only for
the last padded variables, when the four points are the last four
original variables.
B.4. Compression
Two main compression techniques, Fourier transform and
wavelets, are used in multivariate calibration. Both are row transforms, but in the case of wavelets also column information is
generally used.

A low-pass filter is generally applied, to eliminate the high


frequency components, frequently many components. The
remaining amplitudes constitute the compressed signal.
Wavelets compression has been used extensively in multivariate calibration.
Wavelets is the name of a technique developed to study, as the
Fourier transform, the frequency components of a signal with the
main objective of compression. However, Fourier transform are
global in the time domain, what means that when the frequency
components of the signal are different in the time the Fourier
transform is not able to recognize the location of the different
frequencies in the time.
On the contrary, Wavelet decomposition [16,17] provides
local information in the time domain. Wavelets work with filters
of windows that increase as the elements of a geometric series
of common ratio 2, i.e., 2, 4, 8, 16, 32, . . .. So the frequency
components of the signal are not explored as well as in Fourier
transform.
Wavelets are based on the repeated use of low-pass and
high pass filters. There are many type of filters, the families
of wavelets.
The simplest filter is the Haar filter, here explained in connection with the so-called DWT, Discrete Wavelet transform, and
with the Mallat pyramid (or DWT tree) shown in Fig. B.3.
In the first step (level), the output (Approximations or Scaling) of the low-pass filter, the smoothing filter, is the mean of
two consecutive values of the variables (signal) s, si and si + 1 ,
with i ODD (Fig. B.4):
a1,(i+1)/2 =

1
1
si + si+1
2
2

Fig. B.4. Approximations and details with the Haar filter.

M. Forina et al. / J. Chromatogr. A 1158 (2007) 6193

83

Fig. B.5. DWT wavelet spectrum.

Be V the number of values of the signal. The values obtained by


the low-pass filter are V/2. For this reason V must be EVEN, better it must be an integer power of 2 (in the case of an odd number
of variables, truncation or expansion with zeroes or padding is
required).
The output of the high pass filter, the enhancement filter
(Details or Wavelets) is the noise, i.e., the difference between
the same consecutive values of the signal:
d1,(i+1)/2 =

1
1
si si+1
2
2

From the approximations and details the signal can be obtained


by the inverse transforms:

what means that the filter operates on a window of four elements


of the signal.
At the third level, from the four approximations of the second
level, two approximations and two details are obtained. The first
approximation is the mean of the first eight elements of the
signal, the second, a3,2 , is the mean of the other eight elements
of the signal.
Finally the filters are applied to the two approximations of the
third level, and one approximation and one detail are obtained.
The approximation is the mean of all the elements of the signal.
Because each step divides by 2 the number of approximations
the number of elements in the signal must be an integer power
of 2, 2K . The maximum number of levels in the Mallat pyramid
is consequently K.

si = a1,(i+1)/2 + d1,(i+1)/2
B.5. The bases
si+1 = a1,(i+1)/2 d1,(i+1)/2
In the example of Fig. B.4, with 16 data, the first level computes
eight approximations and eight details. In the second step the
filters are applied to the eight approximations computed in the
fist step:
a2,(j+1)/2 =

1
1
a1,j + a1,j+1
2
2

d2,(j+1)/2 =

1
1
a1,j a1,j+1
2
2

where j is the index of the approximation in the first level,


j = (i + 1)/2.
So the four approximations and the four details of the second
level are computed.
Note that, because of the previous equations:




1 1
1 1
1
1
si + si+1 +
si+2 + si+3
a2,(j+1)/2 =
2 2
2
2 2
2
=

si + si+1 + si+2 + si+3


4

A base is a set of V elements (approximations and details)


from which it is possible to reconstruct perfectly the original
signal.
In the case of DWT the usual base is constituted by the mean
(final approximation value, or simply scaling) and by the details.
It is usually called wavelet spectrum (Fig. B.5).
From a4,1 and d4,1 it is possible to compute a3,1 and a3,2 .
From these two approximations and from details d3,1 and d3,2 it
is possible to obtain the four approximations of the second level,
and so on.
Generally the wavelet spectrum contains the data average and
log2 (V) coefficient bands whose size is an increasing power of
two (e.g., 20 , 21 , 22 , . . .).
The wavelet spectrum is not the only possible base. The other
possible bases, in the case of V = 16, are reported in Fig. B.6.
We will use here the Ian Kaplan example (Kaplan data set)
of a signal with 16 elements:
S = {32, 10, 20, 38, 37, 28, 38, 34, 18, 24, 18, 9, 23, 24,
28, 34}
The Mallat pyramid is shown in Table B.1.

Fig. B.6. The possible bases (black) for V = 16.

84

M. Forina et al. / J. Chromatogr. A 1158 (2007) 6193

Table B.1
Mallat pyramid of Kaplan data set
32

10

20

38

37

28

38

34

18

24

18

23

24

28

34

21.000
25.000
29.625
25.937

29.000
34.250
22.250
3.6875

32.500
17.250
4.625

36.00
27.250
5.000

21.00
4.00

13.50
1.75

23.50
3.75

31.00
3.75

11.00

9.00

4.50

2.00

3.00

4.50

0.50

3.00

B.6. Wavelets packet transform

So:


After the first level, the application of the filters also to the
details produces an extended tree, the Wavelet Packet Transform,
WPT, shown in Fig. B.7.
WPT produces information also on the frequency structure
of the details, providing a more complete information.
Moreover, WPT has a larger choice of possible bases, those
of DWT and others. The number of possible bases is important
because a base can be simplified by elimination of approximation or details very small, to describe a signal, in the space of
wavelet frequencies, with a reduced number of variables (compression).
B.7. Orthogonalization
The coefficients of the low-pass filters of Haar wavelets
are:
h1 = 0.5
h2 = 0.5

h2j = 1

gj2 = 1

hj gj = 1

So, the normalized coefficients are orthogonal, and they have


the significance of loadings, as those of principal components.
Approximations and details can be considered as similar to the
scores. The first level corresponds to the two first components,
the second to the next two components, and so on.
The normalization introduces some changes respect what we
obtained with the non-normalized coefficients.
For a term t (approximation or detail) in level j:
normalised
tji
= 2j/2

nonnormalised
tji

The consequences of normalization are:


(a) The power (sum of squares of the scaling and wavelet coefficients for given level) is constant;
(b) The reproduction of the signal requires, for each element,
the division for 2j/2 (with j index of the level).

and those of the high-pass filter are:


g1 = 0.5
g2 = 0.5
The coefficients of the filters can be normalized, so that the sum
of their squares become 1.
The normalized coefficients are:

h1 = 1/ 2

h2 = 1/ 2

g1 = 1/ 2

g2 = 1/ 2

B.8. Daubechies wavelets


The Haar wavelets have obvious limitations, e.g., a discontinuity on the signal among the elements 4 and 5 can be detected
only at the level 3.
There are many families of wavelet filters. The more known
are the Daubechies wavelets. They work on an even number of
points and here they are indicated ad D2, D4, D6, . . ., D2 is the
same as the Haar normalized filter.

Fig. B.7. The wavelet packet transform.

M. Forina et al. / J. Chromatogr. A 1158 (2007) 6193

The scaling coefficients are usually indicated with the letter


h. In the case of D4 they are:

1+ 3

h1 =
4 2

3+ 3

h2 =
4 2

3 3

h3 =
4 2

1 3

h4 =
4 2

85

The wavelet coefficients, usually indicated with the letter g, are:

(3) the variance spectrum of wavelets coefficients [18]. The


wavelet transform is performed on all the objects. In the
case of DWT a matrix of N rows and the V wavelets
coefficients of the wavelets spectrum (those in Fig. B.5)
is obtained. A variance spectrum of the coefficients is
computed. The elements are sorted in decreasing order
and the first elements (corresponding to a predefined
percentage of the total variance) are retained. Then the
selection for each object is made as in point 1. In the
WPT case for each element of the tree (maximum number of levels K times the number of variables V) the
variance of the elements is computed. The best base is
selected on this variance tree, and then applied to all the
objects.

g1 = h4

Appendix C. Principal components analysis

g2 = h3
g3 = h2
g4 = h1
Each time a twin filter (low-pass, h, and high-pass, g) is applied,
to four consecutive elements of the signal, signal s, from si
to si+3 , with i odd, it produces one approximation and one
detail.
With V (integer power of 2) elements of signal, when
the filter is applied to sV 1 , the second element is sV , so
that other two elements are necessary. They are s1 and s2 .
In other words the signal is considered as periodic (for
this reason generally wavelets are applied after detrending).
B.9. The best base
One of the objectives of the wavelet user is compression.
The best base is selected taking into account the objective of
compression.
There are many criteria to select the compressed best base,
among them that based on a cut-off value and that based on
the Shannon entropy. The result is a base with many cancelled
elements.
In the case of many signals (objects) it is necessary to have
a common best base, i.e., the same new variables for all the
objects.
The choice of a common basis can be made with:
(1) the signal variance spectrumA2 2 ; it is a vector of V elements,
each element is the variance of each variable, computed
on the N samples; the wavelet transform and the best
base selection is performed on the signal variance spectrum. Then all the wavelet decomposition is performed on
all the objects and for each objects the elements retained
are those in the best base of the signal variance spectrum;
(2) the spectrum of the absolute correlation coefficients of each
variable with the response; then the selection for each object
is made as in point 1.

Principal component analysis (PCA) is a technique of orthogonal rotation, generally around the centroid of the data and after
autoscaling of the variables.
In the orthogonal rotation the variances on the axes, the
covariance and the correlation coefficients between the axes,
change.
The trace and the determinant of the variance covariance
matrix are constant in the rotation: the total dispersion (trace)
and the dispersion in the multidimensional space (determinant)
are invariant in rotation.
Fig. C.1 show an example of two centered variables, with
positive correlation coefficient.
After rotation with a suitable angle, in the rotated space
(Fig. C.2) the new variables (the principal components) are:
(a) not correlated;
(b) ordered according to their variance (eigenvalue). The first
component is the direction of maximum variance.
The variancecovariance matrix of the original variables is:
292.959
152.692

152.692
117.635

Fig. C.1. Fifty objects in the space of two variables.

86

M. Forina et al. / J. Chromatogr. A 1158 (2007) 6193

Fig. C.2. Objects of Fig. C.1 in the space of the two principal components.

Fig. C.4. Objects of Fig. C.1 (but with variable 2 modified) in the space of the
two principal components.

with
Trace: 410.594
Determinant: 11147.56

After rotation, clockwise 30 , the variancecovariance matrix


has been diagonalized:
381.364
0

0
29.231

Moreover in the rotated space the directions of the original


variables can be shown, to maintain the contact between the
abstract mathematical variables (PCs) and the concrete measured original variables. The position of the objects in the PC
plot can be interpreted in term of the original variables: e.g., the
point P shown in Fig. C.2 has a relatively large value of both
variable 1 and 2.
PC component rotation is not unique. The result in Fig. C.3 is
perfectly equivalent to that in Fig. C.2. Both the relative position
of the objects and their relationship with the original variables
are the same.
PCs are heavily influenced by the scaling procedure. Fig. C.4
shows the same objects of Figs. C.1C.3 when a different unit

(10 times larger) is used for variable 2. The variance of this


variable is reduced by a factor 100.
So, the variancecovariance matrix is:
292.959
15.2692

15.692
1.17635

with
Trace: 294.136
Determinant: 111.4756

After the rotation, clockwise, only 3 , the diagonalized matrix


is:
293.756
0

0
0.379

Always, the first component has a large direction cosine with


the variable with the largest variance.
In the case the variables have different nature, frequently
when they have very different range, the autoscaling of the variables is necessary. After autoscaling all the variables have the
same variance, 1.
In the case of data in Fig. C.1, after autoscaling the
variancecovariance matrix is the matrix of correlation coefficients:
1
0.8825

0.8225
1

with
Trace: 2
Determinant: 0.3235.

After rotation (45 ) the diagonalized matrix is


1.8825
0
Fig. C.3. Objects of Fig. C.1 in the space of the two principal components
alternative directions of the components.

0
0.1175

The plot on the principal components is shown in Fig. C.5.

M. Forina et al. / J. Chromatogr. A 1158 (2007) 6193

87

Fig. C.6. The elimination of the information related to the direction of maximum
variability.

Fig. C.5. Objects of Fig. C.4 in the space of the two principal components, after
autoscaling.

matrix with the direction cosines is called matrix of loadings


and indicated with L.
So, after PC rotation of the matrix XNV :
S NV = XNV LTVV

The only problem is that variables with very small range frequently are noisy variables, e.g., in spectroscopy small range
variables correspond to wavelengths where there is no absorption. Usually, chemists do not use these variables. In the case
of spectra, where all the variables have the same measurement
unit, PC can be computed without autoscaling.
In the case of V > 2 variables, PC rotation computes V PCs,
all non correlated, orthogonal, and they are ordered according
their decreasing eigenvalue.
Also with many variables, very frequently only the first components have a significant eigenvalue. The last components, with
very small variance, are generally related to the noise.
This is the case of spectra, with a very large correlation
among the variables. In the case of spectra of mixtures of
C components there are only C important principal components, the others are surely noisy. The significant mathematical,
abstract, components correspond to the real chemical components.
There are many statistical techniques to detect the number of
significant components. The old Kramer rule indicates as significant the components with eigenvalue larger than the mean
variance (the trace divided by the number of variables). Too frequently this rule gives bad results. Double-cross validation [19]
is the best technique for usual chemical problems.
The starting information (variability and correlations) is represented by the covariance matrix with V variances and V
(V 1)/2 covariances: with one thousand variables there are
500500 elements. After PC rotation with 10 significant components the information is reduced to 10.
PCA can be used:
(a) to visualize a large amount of information (not necessarily
but frequently useful information) in the plots of the first
PCs;
(b) to compress data, from V original variables to C significant
components.
In the usual terminology of chemometrics the coordinates in
the space of PCs are called scores, and the orthogonal rotation

S matrix of SCORES

where the orthogonal rotation matrix LT is reported as the transpose of L, the orthogonal matrix used in the inverse rotation:
XNV = S NV LVV

L matrix of LOADINGS

From the A components we can also compute a matrix:


XAFA = S NA LAV
the data matrix reproduced with the first A components, possibly the significant components. The components are abstract
mathematical factors, so the name of the reproduced matrix
means abstract factor analysis (AFA) reproduced data matrix,
the matrix without the noise associated to the components from
A + 1 to V.
Finally it is possible to reconstruct the data matrix without
the information carried by one or more components, e.g., we can
eliminate the information from the first component:
xwithout
=
iv

V


sia lav

a=2

Fig. C.6 shows a physical analog. When the object is observed in


the direction of the first component (the maximum dispersion),
its information is canceled.
Appendix D. Elimination of useless predictors and
biased regression techniques
D.1. Selection techniques
Ordinary least-squares multivariate regression (OLS) computes the regression coefficients of the regression model
y = b1 x1 + + bv xv + + bV xV + bV +1 = xTM bM
where xTM is the row vector of the V predictors added with 1 for
the intercept, by means of:
1

bM = (XTMN XNM )

XTMN yN

88

M. Forina et al. / J. Chromatogr. A 1158 (2007) 6193

To obtain the vector of the regression coefficient it is necessary


to invert the information matrix, what is impossible when N < M
or when the predictors are heavily correlated.
Some strategies solve this problem by means of selection of
a subset of useful predictors. They are:
(a)
(b)
(c)
(d)

all-subsets OLS;
stepwise OLS;
genetic algorithms selection;
stepwise orthogonalization.

All-subsets OLS is based on the use of all the possible combinations of a suitable number H < V of predictors. Very rarely
used, it requires generally a very long computing time.
Stepwise OLS (SOLS) is a popular and very useful technique:
(1) In each step SOLS selects the predictor that increases
more the explained variance. V univariate regression models
y = a + bv xv are computed. The predictor with the miniN

mum variance of the residuals sv2 =
(yi y i )2 /N 1 is
i

selected. The residuals ri = yi y i are computed and they


will used as variables in the next selection step.
(2) SOLS uses ANOVA to evaluate the importance of predictors. The selection stops when the significance test accepts
the null hypothesis.
(3) SOLS uses a similar F test to verify if a selected predictor
can be removed.
Genetic algorithms (GA) selection [20] explores the possible
combinations of a suitable number H < V of predictors. When
the number of possible combinations is very large GA generally
found a near-to-optimum solution in a reasonable time.
Stepwise orthogonalization [21] is rather similar to stepwise
OLS. The first selected variable is the same. Be this variable Z.
It is decorrelated from the other variables, by means of
sx
Decorrelated
xiv
= xiv rxv z v zi
sz
where rxv z is the correlation coefficient between the variable Xv
and Z, and the s are the standard deviations. The second step
works on the decorrelated variables, selects a second predictor
Z, decorrelates it from the remaining unselected predictors, and
so on.

Fig. D.1. A small data set visualized in the space of two predictors and of the
response.

long computing time is required to select the optimum value of


the ridge parameter, especially when V is large. Probably for this
reason RR is rarely used in multivariate regression.
PCR first computes the principal components of the predictors. Then SOLS work on the components to select those more
useful to obtain a good regression model. PCR is an excellent
technique.
PLS is the technique more frequently used in multivariate
calibration. It works with a series of simple regressions through
the origin of the single predictors on the response. The variables
are almost always centered, frequently autoscaled. Consider the
data in Fig. D.1.
In this example the data have not been centered.
PLS computes the slopes of the marginal regressions (X1 on
Y and X2 on Y), proportional to the angles . More the covariance between a predictor and the response is large (for centered
variables) more the slope is large (Fig. D.2).
The slopes are normalized, so that the sum of their squares
is 1. So the normalize slopes, the PLS weights w, have the
significance of direction cosines, that define a direction in the
space of the predictors (Fig. D.3). This is the first latent variable or component of the PLS model. It is the direction, in the
space of the predictors, with the maximum covariance with the
response.

D.2. Biased regression techniques


The most important biased regression techniques are:
(a) Ridge regression (RR);
(b) Principal component regression (PCR);
(c) Partial least squares regression (PLS).
RR adds to the elements on the diagonal of the matrix of
the information matrix XTMN XNM a small quantity (the ridge
parameter). This addition makes the matrix invertible. A rather

Fig. D.2. The partial least squares regressions of PLS.

M. Forina et al. / J. Chromatogr. A 1158 (2007) 6193

89

From the scores on the A latent variables used in the regression


model the leverage of each object can be obtained as:
 tia
1
+
N 2
N
j=1 tja
A

hA
i =

a=1

The mean leverage is (A + 1)/N.


The elimination of the information related to the latent variables from both the predictors and the response decreases the
variance of the predictors and of the response. A residual variance of the predictors, X residual variance, can be computed for
the objects after each latent variable.
Because of the partial regression PLS can work without problems also when the number of variables is larger than the number
of samples, as usual in multivariate calibration.
However:
Fig. D.3. PLS latent variable.

Fig. D.4. Projection on the PLS latent variable.

The objects are projected (t i = xTi w) on the first latent variable (Fig. D.4), and their scores, t1 , t2 , . . ., tN , are obtained:
The regression through the origin of the response on the first
latent variable gives the first PLS model, as function of the
scores ti (Fig. D.5) or, by means of the equation t i = xTi w of
the predictors (the so-called closed form of PLS).
The information on the first latent variable is eliminated from
both the predictors and the response (see Appendix C), and the
second latent variables is computed.

Fig. D.5. PLS models the response as a function of the latent variable.

(a) OLS has a unique solution. Instead PLS computes as many


models as the number of latent variables. So, the optimum
number of latent variables must be determined. This is
generally that corresponding to the model with the lowest SDEP. Fig. D.6 shows as SDEP decreases neatly with
the first latent variables, what means that these latent variables give more useful information than noise. Then, SDEP
is more or less constant. Finally, SDEP increases because
the latent variables contain more noise than useful information.
The minimum SDEP, in the case of Fig. D.6, is obtained
with 10 latent variables, but the difference with the result
obtained with 8 or 9 latent variables is very small.
The number of latent variables corresponding to the
minimum SDEP is uncertain, depending on the validation procedure, and, in the case of CV validation, also
on the order of the objects, as shown in the example of
Fig. D.7.
More the number of latent variables (the complexity of
PLS model) is, more the regression coefficients in the closed
form are large. What means that the errors on predictors
have larger influence on the prediction. For this reason,

Fig. D.6. SDEP as a function of the number of latent variables (data set Hay-1,
response moisture, centered data, five CV groups).

90

M. Forina et al. / J. Chromatogr. A 1158 (2007) 6193

Fig. D.7. Data set Soy, response moisture, centered data, five CV groups. Validation repeated with 10 different orders of the objects.

and because of the uncertainty on the real minimum, it is


frequently preferable to reduce the complexity.
Many empirical procedure and statistical tests have been
developed to find the optimum complexity. Between them
there are the Osten F statistics [22], the Osten threshold
[22], the van der Voet randomization test [23]. In the case
of the example in Fig. D.6 all these tests suggest eight latent
variables.
(b) all the variables contribute to the latent variables, also
noisy variables, obviously with a relatively small value of
the PLS weight. When the number of noisy variables is
large the regression model can have poor predictive performance. Many procedures have been suggested to improve
the performance of PLS by elimination of noisy variables
[24]:
Iterative stepwise elimination (ISE)
Iterative predictor weighting (IPW)
Uninformative variable elimination (UVE)
Generating optimal linear PLS estimations (GOLPE)
Martens uncertainty test (MUT)
Maximum correlation (MAXCOR)
Genetic algorithms PLS (GA-PLS)
Interval PLS (IPLS)
ISE [25,26] is based on the elimination of predictors with
small regression coefficient.
IPW [27] starts with the usual PLS regression. The importance, product between the absolute value of the regression
coefficient bv and the standard deviation sv (1 in the case of
autoscaled data) of each predictor, weights in the next steps
each predictor. So, a predictor with a small bv will have in the
next IPW cycle a smaller covariance with the response and consequently a smaller bv , and finally, after two-ten cycles, it will
be eliminated (bv = 0).
UVE [28] adds to the original predictors an equal number
of random predictors, with very small value (range of about
1010 ), so that their influence on the regression coefficients of
the original predictors is negligible. The effect of the random

predictors is compared with that of the original predictors and the


original predictors that behave as the added artificial predictors
are canceled.
GOLPE [29] is a procedure used frequently in QSAR, very
rarely in multivariate calibration. Also GOLPE adds a number
of dummy predictors Then GOLPE builds a large number of
reduced models similar to the complete model but removing
some variables. The effect of a predictor is measured by the
predictive ability of the models where it is present, and compared with the effect of the dummy predictors to decide about
its elimination.
The Martens Uncertainty Test [30] is based on the standard
deviation of the regression coefficients bv , computed from their
values in the cycles of leave-one-out cross-validation. The predictors for which the hypothesis v = 0 is accepted at significance
level 5% are eliminated (Student t-test).
MUT is the procedure used by Unscrambler, a data-analysis
package very used in the world of multivariate calibration.
GA-PLS [31] searches for the subset of predictors that produces the maximum prediction ability when the regression
technique is PLS.
IPLS [32] divides the predictors in intervals, then it selects
the best combination of intervals. When the number of intervals is very large IPLS approaches the all-subsets procedure
(but the computing time become too large). When properly used
[33] also stepwise orthogonalization selects suitable intervals of
predictors.
Appendix E. Articial neural networks
Artificial neural networks (ANNs) appeared in chemometrics
very soon, in the package ARTHUR developed in Seattle by
Bruce Kowalski. This first ANN was the very simple Linear
Learning Machine (generally known as Perceptron), based on a
try-and-correct procedure as all ANN.
Actually ANNs are very efficient, mainly for non-linear calibration problems.
Two main types of ANNs are used [34].
Multi-layer feed-forward neural networks (MLF) are currently used with one or more intermediate layers, the hidden
layers.
Fig. E.1 shows a typical arrangement of a three-layer net,
with J + 1 neurons in the hidden layer. Each neuron contains
a single value. The input layer receives the V predictors and
adds the constant 1 (for the intercept). J neurons in the hidden
layer receive a different linear combination Vj of the variables
(J (V + 1) coefficients). The output neuron receives a different
linear combination of the Vj signals and transforms it by means
of a sigmoid function (J + 1 coefficients). Because of the sigmoid
function and of the JV + 2 J + 2 coefficients the regression model
is non-linear and very complex.
The net starts with random values of the coefficients. In an
epoch the objects of the training set are random presented to
the net, and for each object the value from the output neuron
is compared with the value of the response. The error is used
to correct the coefficients. After each epoch the fitting error is

M. Forina et al. / J. Chromatogr. A 1158 (2007) 6193

91

Fig. E.1. Three layers MLF.

evaluated. It continuously decreases. A set of objects (the optimization set) is used to evaluate SDEP. The learning stops when
SDEP reaches a minimum or is about constant (Fig. E.2). A third
set is frequently used to evaluate the true predictive performance
of the regression model.
Because of the optimization and the test set ANN require
many well selected objects. Because of the number of possibilities (compression, number of hidden layer and neurons, learning
rate) the selection of the best net and the learning procedure
can be very long when the number of predictors is large, so
that frequently they are applied to the principal components.
Also the elimination of useless predictors is a time consuming
task, performed by means of genetic algorithms [35]. ANN do
not provide many statistical information and must be considered as black-box tools. In our knowledge, they have not been
compared with tools as non-linear PLS, stepwise quadratic or
cubic regression, Alternating Conditional Expectations regression [36], methods that can give good results in the case of
non-linear relationship between predictors and response.
ANN software can be found easily, and ANN calibration
models, computed with a very large number of samples, are
sometimes sold with instruments.
Counter-propagation networks are constituted by a twodimensional arrangement of neurons (Fig. E.3).
The neurons in the first layer contain a vector of V data (corresponding to the V predictors), those in the second layer contain
a single value, corresponding to the response.

Fig. E.2. MLF applied to Kalivas data, optimization set with 25 objects. Predictors: 15 PCs of autoscaled data.

The net starts with random values. In an epoch the objects


of the training set are random presented to the net. Each time
an object is presented to the net, the more similar neuron in the
Kohonen layer is the winner. It is modified to become more
similar to the presented object. Also the neurons closest to the
winner are modified, but less. Also the neuron corresponding to
the winner in the Grossberg layer and the closest neurons are
modified to become more similar to the response of the presented object. The extent of the modifications decreases with
the epochs, and a final almost steady state is reached. A neuron of the final Kohonen layer is the winner for a number of
very similar objects, and the nearest neurons are the winner of
rather similar objects, with close values of the corresponding
response in the Grossberg layer as shown in the example of
Fig. E.4.
When an object, not used to train the net, is presented, the
winner in the Kohonen layer is found and the corresponding
neuron in the Grossberg layer gives the predicted value of the
response.

Fig. E.3. Counterpropagation networks.

92

M. Forina et al. / J. Chromatogr. A 1158 (2007) 6193

F.5. Kalivas
NIR spectra of 100 flour samples [38]. Wavelength range
from 1101 to 2502 nm with 2 nm interval (701 predictors). The
responses were moisture and protein. The second derivative of
the spectra, obtained by means of third-order smoothing polynomial through eleven points, was used.
References

Fig. E.4. Contour map of the response % protein in the Grossberg layer (data
set Kalivas).

Appendix F. Data sets


F.1. Soy-1
Sixty samples of soy flour, NIR spectroscopy, 175 predictors.
The response variables are moisture, protein and oil. Details are
reported in the original paper [37].
F.2. Soy-2
The same samples as in Soy data, but NIR spectra measured
with a different instrument.
F.3. Hay-1
This data [8] set has been obtained from Laboratori Agroalimentari de Cabrils (Cabrils Agri-food Laboratory), Dept. of
Agriculture, Livestock and Fisheries, Government of Catalonia (Spain). It has been used in a large interlaboratory study of
chemometric software and methods [8].
Three hundred five samples of forage, NIR spectroscopy,
174 predictors. The second derivative was obtained by means
of a moving smoothing third-order polynomial through 11
points.
F.4. Hay-2
This data set has been obtained by Hay-1, with the inversion
of the value of the response Moisture for two adjacent samples,
samples 114 and 115 in the original table.

[1] H. Martens, T. Naes, Multivariate Calibration, Wiley, Chichester, 1991.


[2] T. Naes, T. Isaksson, T. Fearn, T. Davies, Multivariate Calibration and
Classification, NIR Publications, Chichester, 2002.
[3] O.E. de Noord, Chemometr. Intell. Lab. 25 (1994) 85.
[4] R. Bro, Anal. Chim. Acta 500 (2003) 185.
[5] B.R. Kowalski, M.B. Seasholtz, J. Chemometr. 5 (1991) 129.
[6] M. Forina, S. Lanteri, C. Armanino, M.C. Cerrato Oliveros, C. Casolino, VParvus. An extendable package of programs for explorative data analysis,
classification and regression analysis. Dip. Chimica e Tecnologie Farmaceutiche ed Alimentari, University of Genova, 2003. Free available from
the authors.
[7] R.W. Kennard, L.A. Stone, Technometrics 11 (1969) 137.
[8] Ruisanchez, F.X. Rius, S. Maspoch, J. Coello, T. Azzouz, R. Tauler, L.
Sarabia, M.C. Ortiz, J.A., Fernandez, D. Massart, A. Puigdom`enech, C.
Garca, Chemometr. Intell. Lab. 63 (2002) 93.
[9] S. Wold, H. Antti, F. Lindgren, J. Ohman, Chemometr. Intell. Lab. 44 (1998)
175.
[10] J.S. Shenk, M.O. Westerhaus, W.C. Templeton Jr., Crop. Sci. 25 (1985)
159.
[11] Y. Wang, D.J. Veltkamp, B.R. Kowalski, Anal. Chem. 63 (1991)
2750.
[12] M. Forina, G. Drava, C. Armanino, R. Boggia, S. Lanteri, R. Leardi, P.
Corti, R. Giangiacomo, C. Galliena, R. Bigoni, I. Quartari, C. Serra, D.
Ferri, O. Leoni, L. Lazzeri, Chemometr. Intell. Lab. 27 (1995) 189.
[13] B.G. Osborne, T. Fearn, J. Food Technol. 18 (1983) 453.
[14] E. Bouveresse, C. Hartmann, D.L. Massart, I.R. Last, K.A. Prebble, Anal.
Chem. 68 (1996) 982.
[15] E. Bouveresse, Ph.D. Thesis, Vrije Universiteit Brussel, Brussels, 1997.
[16] B. Walczak, D.L. Massart, Chemometr. Intell. Lab. 38 (1997) 81.
[17] B. Walczak, D.L. Massart, Chemometr. Intell. Lab. 38 (1997) 39.
[18] B. Walczak, D.L. Massart, in: B. Walczak (Ed.), Wavelets in Chemistry,
Elsevier, Amsterdam, 2000, p. 165.
[19] S. Wold, Technometrics 20 (1978) 397.
[20] C.B. Lucasius, G. Kateman, Trends Anal. Chem. 10 (1991) 254.
[21] B.R. Kowalski, C.F. Bender, Pattern Recogn. 8 (1976) 1.
[22] D. Osten, J. Chemometr. 2 (1988) 39.
[23] H. van der Voet, Chemometr. Intell. Lab. 25 (1994) 313.
[24] M. Forina, S. Lanteri, M.C. Cerrato Oliveros, C. Pizarro Millan, Anal.
Bioanal. Chem. 380 (2004) 397.
[25] A. Garido Frenich, D. Jouan-Rimbaud, D.L. Massart, S. Kuttatharmmakul,
M. Martinez Galera, J.L. Martinez Vidal, Analyst 120 (1995) 2787.
[26] R. Boggia, M. Forina, P. Fossa, L. Mosti, Quant. Struct. Act. Relat. 16
(1997) 201.
[27] M. Forina, C. Casolino, C. Pizarro Millan, J. Chemometr. 13 (1999)
165.
[28] V. Centner, D.L. Massart, O.E. de Noord, S. de Jong, B.M. Vandeginste,
C. Sterna, Anal. Chem. 68 (1996) 3851.
[29] G. Cruciani, S. Clementi, M. Pastor, in: H. Kubinyi, G. Folkers, Y.C. Martin (Eds.), 3D-QSAR in Drug Design. Recent Advances, Kluwer/Escom,
Dordrecht, 1998, p. 71.
[30] F. Westad, H. Martens, J. Near Infrared Spec. 8 (2000) 117.
[31] R. Leardi, R. Boggia, M. Terrile, J. Chemometr. 6 (1992) 267.
[32] L. Nrgaard, A. Saudland, J. Wagner, J.P. Nielsen, L. Munck, S.B.
Engelsen, Appl. Spectrosc. 54 (2000) 413.
[33] M. Forina, S. Lanteri, M. Casale, M.C. Cerrato Oliveros, Chemometr. Intell.
Lab. 87 (2007) 252261.

M. Forina et al. / J. Chromatogr. A 1158 (2007) 6193


[34] B.G.M. Vandeginste, D.L. Massart, L.M.C. Buydens, S. de Jong, P.J. Lewi,
J. Smeyers-Verbeke, Handbook of Chemometrics and Qualimetrics: Part
B, Elsevier, Amsterdam, 1998, p. 649.
[35] T. Hill, P. Lewicki, Statistics: Methods and Applications, StatSoft, Tulsa,
OK, 2006.

93

[36] L. Breiman, J.H. Friedman, J. Am. Stat. Assoc. 80 (1985) 580.


[37] M. Forina, G. Drava, C. Armanino, R. Boggia, S. Lanteri, R. Leardi, P.
Corti, R. Giangiacomo, C. Galliena, R. Bigoni, I. Quartari, C. Sella, D.
Ferri, O. Leoni, L. Lazzeri, Chemometr. Intell. Lab. 27 (1995) 189.
[38] J.H. Kalivas, Chemometr. Intell. Lab. 37 (1997) 255.

You might also like