You are on page 1of 4

Letter to the Editors www.molinf.

com

DOI: 10.1002/minf.201400030

External Evaluation of QSAR Models, in Addition to Cross-


Validation: Verification of Predictive Capability on Totally
New Chemicals
Paola Gramatica*[a]

Dear Editors, Indeed, internal validation parameters for proposed


an interesting paper of Gtlein et al., recently published QSAR models must always be reported in publications to
in your journal,[1] has reopened the debate on the crucial guarantee model robustness.
topic of QSAR model validation, which, over the past Moreover, in QSAR modelling, it is important to distin-
decade, has been the subject of wide discussions in scien- guish an approach proposing predicted data from a specific
tific and regulatory communities. Many notable scientific single model (easily reproduced by any user) from an ap-
papers have been published (I cite here only a few of the proach that produces predicted data obtained by averag-
most pertinent[217]) with different underlying ideas on the ing the results from multiple models, and therefore by
best way to validate QSAR models using various meth- a more complex algorithm. In my research I always apply
odological approaches: a) only by cross-validation (CV),[1,69] the first approach, while the work discussed by Gtlein
simple or double CV, b) by an additional external valida- et al. in their paper uses the second one. The reason to
tion,[25, 1017] (better if verified, in my opinion, by different prefer a single model, which is a unique specific regression
statistical parameters),[1518] after the necessary preliminary equation based on a few selected descriptors with their rel-
internal validation by CV. The common final aim is to pro- ative coefficients, is mainly related to the preference that
pose good QSAR models that are not only statistically the unambiguous algorithm, (requested by the second
robust, but also with a verified high predictive capability. Principles of the famous OECD Principles for validation of
The discrepancy in these two approaches lies in this point: QSAR models and applicability in regulation[19] ) would be
how to verify the predictive performance of a QSAR model the simplest and most easily reproducible, and therefore
when applied to completely new chemicals. easily applicable by a wide number of users, including reg-
In the Introduction to their paper[1] Gtlein et al. wrote: ulators in the new European legislation on chemicals
Many (Q)SAR researchers consider validation with a single REACH.
external test set as the gold standard to assess model per- According to Principle 4, discussed in depth in my previ-
formance and they question the reliability of cross-validation ous paper[10] and in the Guidance Documents of the OECD
procedures. In my opinion, this point is not commented on Principles,[20] the model must be verified for its goodness of
clearly, at least in reference to my cited work,[10] so I wish to fit (by R2), robustness (by internal Cross-Validation: Q2LOO
clarify my validation approach in order to highlight and re- and Q2LMO) and external predictivity (on external set com-
solve some misunderstandings. First of all, I am sure that all pounds, which did not take part in the model development).
good QSAR modellers cannot disagree that CV (not simply Also in the Guidance document there is a clear distinction
by LOO, but also by LMO and/or bootstrap) is a necessary between internal and external validation in this sense.
preliminary step in any QSAR validation, and it is unques- Only models with good internal validation parameters
tionably the best way to validate each model for its statisti- that guarantee their robustness should be chosen from
cal performance in terms of the robustness and predictivity among all the single models obtained by using the Genetic
of partial sub-models on chemicals that have been itera- Algorithm (GA) as method for descriptor selection in Ordi-
tively put aside (hold-out) in the test sets. According to nary Least Square (OLS) regression (my QSAR approach, as
some authors,[214] including me,[10] this should be defined implemented in my in-house software QSARINS).[18] Howev-
as the internal validation, because at the end of the com- er, my personal experience (and not only mine)[5,10] is that
plete modelling process the molecular structure of all the some QSAR models show good performance when verified
chemicals has been seen within the validation procedure,
and their structural information has contributed to the mo-
lecular descriptor selection, at least in one run of CV when [a] by P. Gramatica
QSAR Research Unit in Environmental Chemistry and
they were iteratively put in the training sub-set. Therefore,
Ecotoxicology, Department of Theoretical and Applied Sciences,
they are not really external (completely new) to the final University of Insubria
model. Via Dunant 3, 21100, Varese, Italy
*e-mail: paola.gramatica@uninsubria.it

 2014 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Mol. Inf. 2014, 33, 311 314 311
Letter to the Editors www.molinf.com

by CV, but are unable to predict well really unseen new data in a preliminary step before the model development,
chemicals (see as examples the models 46 in Table 1, that putting aside these supposed unknown compounds for
are over-optimistically verified as predictive according to use later in the following evaluation of the models, models
CV, while, by various statistical parameters used for external that are developed only on the remaining training set used
validation,[15,16] they demonstrate their inability to predict in the learning process. In terms of validation procedure
new independent chemicals, when applied to two different there is no difference between an external dataset as tem-
prediction sets). There is evidence that CV is necessary, but porally delayed and the splitting of an available data set,
this is not sufficient to guarantee predictivity for new exter- obtaining in this way an external set. The chemicals put
nal chemicals.[5,10, 20] Therefore, it is not a psychological ar- aside in this preliminary splitting step constitute the set
gument (as stated by Gtlein et al.), but probably the dif- that, preferably, should be called external prediction
ferent philosophical approach of a QSAR modeller who set(s),[10, 1518] or the external evaluation set,[14] to be clearly
wishes to propose only cross-validated single models that distinguished from the iterative test sets of CV.
are additionally verified for their possible predictivity on Therefore, it should be clear that, in this approach, these
truly never-seen chemicals, to guarantee a larger generaliz- two validations of single models have completely different
ability. Certainly we will never wish to propose models that aims, and cannot be used as parallel or alternative process-
could be present in the GA population of CV-validated OLS es but only as sequential ones. The aim of CV is for a pre-
models, as for instance some models in Tab. 1 of the Princi- liminary model validation of each single model in the GA
ples of QSAR model validation paper[10] and the externally population, and to help in the selection of the most robust
unpredictive models no. 46 in Table 1. and internally predictive models; instead the use of exter-
In my cited paper, Principles of QSAR models validation: nal prediction sets has the subsequent goal of evaluating
internal and external,[10] I clarified the different aspects of each single model (these models being based only on the
what are, in my opinion and in the OECD Guidance Docu- structural information found in the training set compounds
ment for QSAR model validation,[20] internal and external and having passed previous cross-validation) with regard to
validations, but additional clarification is needed and is pro- its predictivity on actual unseen compounds whose
vided here. The question that requires an answer, which is chemical structures, as already pointed out, have NEVER in-
obtainable only by an additional external validation of one fluenced the descriptor selection. The single model in the
specific QSAR model, is: Is the developed model, whose GA population of CV-validated robust models which, simu-
robustness has been validated by CV, able to also predict lating a real application of the model, shows also prediction
completely new chemicals? These external chemicals must performances (measured by Q2ext or CCC)[15,16] , similar to the
never be included in training sets during the complete pro- internal ones (measured by Q2LOO and Q2LMO), also on the
cess of the model development, not even in one single iter- prediction set compounds, is preferred as a verified exter-
ation of the k-fold CV procedure; therefore, their structural nally predictive model and is our proposal (for instance
information must NEVER be taken into account. Psychologi- models no.13 in Table 1). To avoid misunderstandings on
cally, the best external set of new chemicals for this evalua- this point it is probably useful, and better, to define this ad-
tion (the so called blind set) should be one that might ditional check on really external compounds as external
become , available to a QSAR modeller after his model de- evaluation or external verification of a specific QSAR
velopment; this set could also be called a temporal set. model, before its proposal.
However, it is very rare to have a blind data set, due to In my recent works (see Gramatica,[17] as an example),
limited data availability and time reasons (we should after having, at the end, checked that the molecular de-
always have new experimental data for QSAR model evalu- scriptors, selected in a robust specific model taking infor-
ation and wait for a temporal set). Therefore, if the QSAR mation only from the structures of two training chemicals,
modeller wishes to verify a real model predictivity, before are successfully able to also predict completely new chemi-
proposing his best single model, his only option is to ex- cals (prediction sets), the same descriptors are used to re-
ploit the actual data availability, sacrificing part of these develop a full model on the complete data set to exploit all

Table 1. Comparison of internal and external validation parameters for some algae toxicity models.
Splitting by structure (Kohonen maps) Splitting by ordered response
2 2 2 2
Variables R QLOO Q LMO Qext-Fn CCCext R2 Q2LOO Q2LMO Q2ext-Fn CCCext
1 T(N..S) AEigZ Seigv* 0.83 0.76 0.72 0.720.84 0.87 0.85 0.79 0.77 0.730.79 0.86
2 AEigm F08[O-O] Seigv 0.84 0.80 0.76 0.720.84 0.84 0.87 0.83 0.81 0.690.76 0.82
3 nDB X2sol JGI4 0.80 0.70 0.66 0.800.88 0.87 0.83 0.76 0.74 0.700.77 0.86
4 Xindex F07[C Cl] F08[O O] 0.84 0.76 0.74 ( 0.02)0.40 0.62 0.83 0.75 0.73 0.100.32 0.62
5 nDB Xt F08[N O] 0.84 0.77 0.75 ( 0.13)0.34 0.60 0.80 0.72 0.71 0.020.25 0.69
6 Xt nCONN nCXr 0.83 0.79 0.77 ( 0.43)0.16 0.58 0.84 0.81 0.78 ( 0.36)( 0.04) 0.56
*Model published by Gramatica et al.[10] For CCC (Concordance Correlation Coefficient) see the literature.[15, 16]

 2014 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Mol. Inf. 2014, 33, 311 314 312
Letter to the Editors www.molinf.com

In my opinion, the interesting results of the Gtlein et al.


paper demonstrate that double CV could give less variable
and better results (the lowest average prediction error for
the available and modeled data) compared to external vali-
dation verified on a single set, but only if these two kinds
of validation are used as alternative validation methods.
However, the better performances in prediction of the
double cross-validated models are, in my opinion, logically
expected, simply because all the structural information of
the complete data set has been viewed at the end in the
overall final algorithm of this ensemble of sub-models. In
fact, some chemicals considered as new or unseen are
new only in each iterated fold when they are hold-out in
the test sets, but not for the complete algorithm.
Some sentences on the web: http://stats.stackexchange.-
com/Cross Validated, which is a question and answer site
for statisticians, data analysts, data miners and data visuali-
zation experts, confirm this point: cross-validation is used
to optimize a model, but no real validation is done.
Cross-validation will lead to lower estimated errors, which
makes sense because you are constantly reusing the data,.
Moreover, I wish to clarify the point that seems not to be
shared by Gtlein et al. the models that are built and vali-
dated on the folds are different from the finally reported
model.[3], [1] exemplifying with a MLR model developed by
GA-OLS, as in our QSARINS software.[18] I recall here that in
every run of the CV k-folds, the chemicals that are put in
the test are predicted by the respective sub-model, which
is developed on a subset of chemicals, different in each
iteration. Even if the selected molecular descriptors are the
same in each sub-model in the iterative runs and in the fi-
nally reported model, the coefficients of the descriptors in
Figure 1. the model equation are different, because in each iteration
the training set is different. It is also important to highlight
the available information. This full model (which is based that the final model obtained from the training, exactly
on the descriptors selected from the trainings, but has dif- with its coefficients, must be applied to the external predic-
ferent coefficients) is normally our final QSAR model, which tion set, while it is not correct to rebuild the model on
we propose.[17] This procedure is schematized in Figure 1 these compounds, otherwise a new model that simply fits
(modified from Figure 1 in Gramatica et al.[17]). the external data and not predicts them is obtained.
I hope that it is now clearer that in my approach exter- An additional crucial aspect that must be taken into con-
nal validation is not versus CV nor is it normally used as sideration is that the composition of the prediction set(s)
a substitute for CV. It is an additional evaluation simulating (as is also true for the training set) has a marked influence
a temporal supposed unknown set, while the compara- on the results. It is not reasonable to verify model predictiv-
tive exercise of Gtlein et al.[1] has a different purpose from ity on too few chemicals, as is sometimes done, because
that of the usual application of external validation of QSAR the results could be good just by chance. Certainly small
models (particularly single models). input data sets are the most problematic to perform relia-
Gtlein et al. remember that Hawkins, in his papers,[7,8] ble external evaluation, and in these cases CV is the only
recommends a true CV without additional external valida- validation that makes sense. Also the nature of the chemi-
tion to avoid the loss of information. The true CV should cals in the prediction set is important: they should have
be done selecting modeling features within each fold of a feature range and distribution similar to the training data,
the CV procedure; however it is important to note that in because any QSAR model can be reliably applied (and
this way several different parallel sub-models, which are therefore must be also verified) only on chemicals belong-
also based on different descriptors, are obtained, one in ing to the same Applicability Domain (AD) of the training
each fold. The predictions of these parallel sub-models, set.[5,10, 17, 20] If the external chemicals are out of the model
Hawkins-type, could be used, for instance, in a consensus AD, the verification of predictive performance would be
approach. performed in an extrapolation zone of less reliable predic-

 2014 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Mol. Inf. 2014, 33, 311 314 313
Letter to the Editors www.molinf.com

tions. For this reason, it is important to apply some rules models, after their internal validation by CV, but this should
for splitting the original data into training set (for the learn- be done already at the proposal step.
ing step and subsequent random splits in training sub-sets
and tests for CV) and prediction set (for external evaluation
after model development and CV).[17,20] It is also useful to
apply different kinds of splitting methods, as implemented Acknowledgements
in QSARINS.[18] To avoid the limitation of using only a single
external set, we, in our recent papers,[17] always verify our I thank Knut Baumann, and also my collaborators Nicola
models on two/three different prediction sets: one ob- Chirico and Stefano Cassani, for the interesting and helpful
tained on the sorted responses (for verifying the model on discussions during the preparation and revision of this
chemicals in the response domain) and the other on struc- letter.
tural similarity by Kohonen Maps (to check the model in
the structural domain) and/or even randomly (that is the
splitting, in a sense, more similar to the real life situation of
References
unknown new chemicals and that, being unbiased for re-
sponse and structure, cannot be accused of purposeful ma- [1] M. Gtlein, C. Helma, A. Karwath, S. Kramer, Mol. Inf. 2013, 32,
nipulation). 516 528.
In conclusion, the two modelling approaches compared [2] H. Kubinyi, A. H. Fred, T. Mietzner, J. Med. Chem. 1998, 41,
here are philosophically different, and neither should be 2553 2564.
[3] H. Kubinyi, Quant. Struct-Act. Relat. 2002, 21, 348 356
considered as right or wrong. The approach based on [4] A. Golbraikh, A. Tropsha, J. Mol. Graph. Model. 2002, 20, 269
double CV is focused only in obtaining the best statistical 276.
performance (the lowest prediction error). The information [5] A. Tropsha, P. Gramatica, V. K. Gombar, QSAR Comb. Sci. 2003,
from all the available data is exploited, therefore the best 22, 69 77.
results are expected. This approach produces an ensemble [6] K. Baumann, TrAC 2003, 22, 395 406.
of different models, each verified on test chemicals that [7] D. M. Hawkins, S. C. Basak, D. Mills, J. Chem. Inf. Comput. Sci.
2003, 43, 579 586.
could be considered as external only for each correspond-
[8] D. M. Hawkins, J. Chem. Inf. Comput. Sci. 2004, 44, 1 12.
ing model, but at the end the complete algorithm is not [9] K. Baumann, N. Stiefl, J. Comput-Aid. Mol. Design 2004, 18,
verified on really new chemicals. Therefore, I agree with 549 562.
this statement in the Gtlein et al. paper:[1] If external vali- [10] P. Gramatica, QSAR Comb. Sci. 2007, 26, 694 701.
dation implies (i) that no instance from any test set is ever [11] A. Tropsha, A. Golbraikh, Curr. Pharm. Des. 2007, 13, 3494
used for building the final model (see e.g. the literature[3,6,20]), 3504.
[12] A. Tropsha, Mol. Inf. 2010, 29, 476 488.
then no form of cross-validation (in which the complete data
[13] K. H. Esbensen, P. Geladi, J. Chemom. 2010; 24,168 187.
set is repeatedly divided into disjoint training and test sets) [14] T. M. Martin, P. Harten, D. M. Young, E. N. Muratov, A. Gol-
can be regarded as external validation braikh, H. Zhu, A. Tropsha, J. Chem. Inf. Model. 2012, 52,
The approach based on additional verification of statisti- 2570 2578.
cally robust models on real external chemicals has the aim [15] N. Chirico, P. Gramatica, J. Chem. Inf. Model. 2011, 51, 2320
to propose good QSAR models (even if, probably, not the 2335.
best possible models) additionally evaluating them for [16] N. Chirico, P. Gramatica, J. Chem. Inf. Model. 2012, 52, 2044
2058.
the external predictivity before their presentation, in order [17] P. Gramatica, S. Cassani, P. P. Roy, S. Kovarich, C. W. Yap, E.
to guarantee a larger generalizability. We hope to avoid the Papa, Mol. Inf. 2012, 31, 817 835.
proposal of models which seem only in appearance predic- [18] P. Gramatica, N. Chirico, E. Papa, S. Cassani, S. Kovarich, J.
tive, as models no. 46 in Table 1. In my opinion, it is im- Comput. Chem. 2013, 34, 2121 2132.
portant to remember that the ultimate goal of a validation [19] OECD Principles 2004; http://www.oecd.org/dataoecd/33/37/
strategy should be to simulate, with sufficient accuracy, the 37849783.pdf (accessed 02/02/2014
[20] Guidance Document on the Validation of (Quantitative) Struc-
difficulties that one would encounter when applying
ture-Activity Relationships Models ENV/JM/MONO(2007)2;
a methodology in future circumstances (new experimental http://search.oecd.org/officialdocuments/displaydocu-
data), trying to represent the future working situation of mentpdf/?doclanguage = en&cote = env/jm/
the particular model: only an additional external evalua- mono%282007 %292 (accessed 02/02/2014).
tion on totally new chemicals can do this for QSAR

 2014 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Mol. Inf. 2014, 33, 311 314 314

You might also like