Professional Documents
Culture Documents
Abstract
This paper critically reviews the problem of over-fitting in multivariate calibration and the conventional validation-based approach to avoid it.
It proposes a randomization test that enables one to assess the statistical significance of each component that enters the model. This alternative is
compared with cross-validation and independent test set validation for the calibration of a near-infrared spectral data set using partial least squares
(PLS) regression. The results indicate that the alternative approach is more objective, since, unlike the validation-based approach, it does not require
the use of soft decision rules. The alternative approach therefore appears to be a useful addition to the chemometricians toolbox.
2007 Elsevier B.V. All rights reserved.
Keywords: Multivariate calibration; PLS; Component selection; Cross-validation; Test set validation; Randomization test; Near-infrared spectroscopy
Corresponding author. Tel.: +31 318 641985; fax: +31 318 642150.
E-mail address: nmf@chemometry.com (N.M. Faber).
0003-2670/$ see front matter 2007 Elsevier B.V. All rights reserved.
doi:10.1016/j.aca.2007.05.030
99
in order to help inexperienced spectroscopists handle the additional computing power that was becoming available. I must
admit that the work of my co-column Editor in pushing for
Good Chemometrics Practice has hopefully raised awareness in the community of the potential pitfalls in using these
packages without due consideration, but I personally have not
been aware of clear unambiguous automated warnings starting to appear when data was being over-tted. (Our italics.)
Over-fitting causes harm because one not only incorporates predictive features of the data in the model, but also noise. The
implication is degraded model performance in the prediction
stage.
An example of over-fitting that is conveniently visualized
occurs when a (two-dimensional) plane is fitted using two scores,
while a (one-dimensional) line, using a single score, would be
appropriate (Fig. 1). It is readily observed that prediction is
still reliable in a restricted region, namely sufficiently close
to the line. The concept of a correct predictive region while
effectively over-fitting is further illustrated for a univariate polynomial fit in Fig. 2. For interpolating points one distinguishes
a small but statistically significant increase of prediction uncertainty. By contrast, a large increase of prediction uncertainty is
clearly observed for extrapolating points. This is all the more
surprising because the fitted relationship is almost the same for
this particular example.
It is generally good advice to avoid extrapolation when
deploying an empirical, entirely data-driven, soft calibration
model, since in a strict sense the estimated relationship is only
supported in a region close to the calibration points. However,
extrapolation is often implied (to some degree) by the goal of the
application. Apart from genuine prediction in time or forecasting
as in Fig. 2, important examples of unavoidable extrapolation
are:
the detection of lower analyte concentrations in trace analysis;
the determination of analyte content using the method of
standard additions;
the development of a product with higher consumer appreciation in sensory and marketing research; and
Fig. 1. Two collinear X-variables onto which the Y-data ( ) are regressed. Note that an extremely high collinearity is the rule for adjacent channels in molecular
spectroscopy, e.g., NIR. The X-variables allow for the construction of only one stable component using the first score, t1 . By contrast, the plane spanned by the first
two scores, t1 and t2 , is unstable. The spread of the fitted Y-data points () around the first axis is caused by noise and should therefore be ignored. The model based
on the first score is an effective noise filter, whereas the plane is over-fitting the data.
100
(independent) validation samples and averaging the squared prediction errors, i.e., the differences between model prediction and
the associated known reference value.1 The square root of this
average squared error is known as the root mean squared error of
prediction (RMSEP). In equation form, for increasing number
of components (A),
Nval
1
2
RMSEP(A) =
(Y A,n Yref,n )
(1)
Nval n
Fig. 2. The population of the USA in the period 19001990 (). This data
set is available through the built-in function census of Matlab (The Mathworks,
Natick, MA, USA). The data are fitted using a polynomial of degree 2 (
)
and 3 (
), respectively. The associated uncertainty bands are given by the
same line type.
101
Fig. 3. Schematic presentations of bias and variance contribution to RMSEP as a function of model dimensionality, e.g., the number of PLS components: (left panel)
) increases rapidly and bias (
) gives a substantial contribution to RMSEP () for the optimum model, and
standard presentation where variance (
(right panel) alternative presentation where variance increases slowly (when interpolating) and bias is relatively small for the optimum model. The latter asymmetric
presentation is usually more realistic in practice and illustrates why under-fitting is seldom a concern.
(2)
where () denotes the standard error of the associated quantity. This expression is an example of the law of diminishing
returns. For example, to have a relative uncertainty of less
than 20% requires about 13 validation samples (that spread
out reasonably well in calibration space). To further reduce
this uncertainty to less than 10% one has to quadruple the
number of validation samples. Eq. (2) intends to enable the
analyst to calculate the number of validation samples (s)he
needs to report an RMSEP estimate in sufficient (significant)
decimal digitsusually two.
Often, the RMSEP estimates do not exhibit a clear global
minimum, as in Fig. 3. This is a direct consequence of the
previous issues. As a result, one often has to resort to soft
decision rules like the first local minimum or the start of a
plateau, which is highly unsatisfactory both from a practical
as well as a scientific point of view. It is important to note that
the previous issues have led researchers to develop error indicator functions that do not require possibly noisy reference
values [13,14].
Specific problems with the conventional approach are:
External validation, i.e., test set validation, is best in the sense
that a closer assessment of RMSEP is possible (test is best).
However, it is wasteful because the validation samples are not
available for the construction of the model.
Cross-validation, on the other hand, ensures a more economic
use of the available data, but it has two major drawbacks. First,
it cannot be used if the data are designed. This can be under-
102
Fig. 4. Generating the distribution under the null-hypothesis (Ho ) by building a series of PLS models after pairing up the observations for predictor (X) and response
(Y) variables at random. Any result obtained by PLS modeling after randomization must be due to chance. Consequently, the statistical significance of the value
obtained for the original data follows from a comparison with the corresponding randomization results.
Fig. 6. Validation results for the example data set: (top panels) internal RMSECV ( ) for the 84 calibration samples and (bottom panels) external RMSEP (
the 155 independent validation samples. To better exploit the vertical scale, the first point is omitted in panels (b) and (d).
ment error standard deviation ref = 0.025 g (100 g)1 . The 239
samples were split into a calibration and validation set by using
the duplex algorithm. This method starts by selecting the two
points furthest from each other and puts them both in a first
set (calibration). Then the next two points furthest from each
other are put in a second set (validation), and the procedure is
continued by alternately placing pairs of points in the first or
second set. As a result, 84 samples were used for calibration
and 155 samples for validation. It is noted that the majority
of the available samples was selected for (external) validation,
which is unusual in practice. However, Fernandez Pierna et al.
had chosen this particular data split to test expressions for multivariate sample-specific prediction uncertainty [19]. In other
words: focus was more on assessing the predictive ability of a
model than on obtaining the best model. Also for the current
study it should be useful to have a relatively large validation
set because external validation is generally considered to be the
golden standard.
3.2. Calculations
The proposed randomization test has been implemented in
Matlab 7.0 (The Mathworks, Natick, MA, USA) and the program
is available from the first author. Histograms of noise values
were generated using 1000 permutations. Although as few as
100 permutations can be used [17], this relatively large number ensures that the resulting histograms are fairly smooth. For
the current example data set (84 samples 2128 wavelengths),
the computations were completed within seven CPU seconds
on a 3.4 GHz personal computer. To calculate the risk of overfitting when, in fact, none of the noise values exceeds the value
under test, the so-called inverse Gaussian function is fit to the
noise values. This function is often suited for modeling positive
and/or positively skewed data [20].
103
) for
104
Fig. 7. Randomization results for the example data set: histogram of 1000 noise values, fit using the inverse Gaussian function (
). The symbol stands for the significance level.
(
105
Fig. 8. Comparison of current practice of multivariate predictive modeling (left) and the one enabled by the proposed alternative (right).
value under test is due to chance () is extremely small for components 1 (0.0009%), 2 (0.02%), 4 (0.0006%) and 5 (0.002%).
Interestingly, the significance of component 3 is only 3.3%. We
speculate this to be due to component 3 taking care, with some
difficulty, of subtle non-linearities in the spectra, after which
the remaining linear contributions are conveniently handled by
components 4 and 5. The high -values for components 68
constitute a clear unambiguous warning that over-fitting starts
after the fifth component.
5. Recommendations
The proposed randomization test enables a different scheme
for calibration modeling (Fig. 8). The essential difference is that
the two critical steps preceding the actual modeling process are
disentangled. The best data pre-treatment depends highly on the
type and quality of the input data. Expert knowledge is valuable
in this stage and allowing for subjectivity here is to be understood in a favorable sense: it may help reducing the inherent
black-box character of soft modeling procedures. By contrast,
the selection of optimum model dimensionality should be kept
fully objective since a human expert cannot judge the observed
modeling power of an additional component to be genuine, i.e.,
not due to chance. An adequate validation step of course one
must validate! then constitutes the justification of the overall
trial-and-error procedure. It is stressed that not completely relying on validation for component selection has an added bonus
in the sense that the RMSEP estimate can be reported with
more confidence, simply because it has not guided component
selection.
Until now the discussion focused completely on quantitative
aspects of multivariate calibration modeling. However, in application areas such as sensory research, qualitative aspects such
106
The alternative enables one to scrutinize individual components without making strong assumptions about the data.
It is user-friendly because it only requires (1) the number
of permutations and (2) the critical significance level to be
selected. The first requirement constitutes the only practical
difference between a randomization test and a conventional
statistical test.
The result is often consistent with the one obtained using
validation (e.g., unscrambler or SIMCA advice), but now it is
fully objective visual inspection does not play a role.
It can replace validation for component selection, but it can
also supplement the common plot (RMSEP estimates vs. components) with an advice.
The applicability of the randomization test is not restricted
to PLS regression. Moreover, it is easily verified that it also
applies to multiway calibration. One only needs to replace the
(1-way) rows of the X-matrix in Fig. 4 by the appropriate data
object (2-way matrices or in general N-way arrays).
Compression may be necessary to handle large data
sets. However, once the compression is done, other
computer-intensive methods such as bootstrap, jack-knife and
cross-validation can be entertained almost for free as well. The
only requirement is that the compression should not introduce
dependencies among the samples.
The currently described randomization test operates on the
calibration set. A purpose could be to add objectivity to crossvalidation, cf. Fig. 6a and b. There is no reason, however, why
it cannot be adapted to add objectivity to test set validation,
cf. Fig. 6c and d.
In summary, the proposed randomization test appears to be a
useful addition to the chemometricians toolbox.
Acknowledgements
The thoughtful comments by Waltraud Kessler (Reutlingen University), Randy Pell (The Dow Chemical Company),
Michael Sjostrom (Umea University) and Svante Wold (Umea
University) are appreciated by the authors. We further thank
Chris Brown (InLight Solutions) for supplying the function for
the inverse Gaussian fit and Alejandro Olivieri (Universidad
Nacional de Rosario) for pointing out a numerical problem in