You are on page 1of 44

4.

05 Chemometrics in QSAR
R. Todeschini and V. Consonni, University of MilanoBicocca, Milan, Italy
P. Gramatica, Insubria University, Varese, Italy
2009 Elsevier B.V. All rights reserved.
4.05.1 Introduction 129
4.05.2 Short History of QSAR and Molecular Descriptors 131
4.05.3 Chemometrics and QSAR Modeling 133
4.05.4 Specific QSAR Approaches 138
4.05.4.1 Hansch Approach 138
4.05.4.2 FreeWilson Approach 139
4.05.4.3 LSER Approach 140
4.05.4.4 Group Contribution Methods 141
4.05.4.5 Cluster Significance Analysis 143
4.05.4.6 Read-Across Approach 144
4.05.5 Molecular Descriptors 144
4.05.5.1 Molecular Structure Representations 146
4.05.5.2 0D Descriptors or Count Descriptors 147
4.05.5.3 1D Descriptors or Fingerprints 148
4.05.5.4 2D Descriptors or Topological Descriptors 148
4.05.5.5 3D Descriptors or Geometrical Descriptors 149
4.05.5.6 4D Descriptors or Grid-Based Descriptors 151
4.05.6 Molecular Descriptor Selection 151
4.05.6.1 Variable Reduction 152
4.05.6.2 Variable Subset Selection 152
4.05.6.3 Consensus Modeling 154
4.05.7 Principles for QSAR Modeling 157
4.05.7.1 Unambiguous Model Algorithm 157
4.05.7.2 Applicability Domain 158
4.05.7.3 Validation 159
4.05.7.4 Model Descriptor Interpretability 163
4.05.7.5 Summaries of QSAR Models 163
4.05.8 Conclusions 164
References 164
4.05.1 Introduction
The discovery of relationships among different concepts, in particular concepts provided by different scientific
fields, represents the most important way to develop new scientific knowledge and transform isolated
information into a deeper theoretical knowledge.
The concepts of molecular structure, its representation by theoretical molecular descriptors, and its relation-
ship with experimental properties of molecules are an interdisciplinary network, where a lot of theories,
knowledge, and methodologies and their interrelationships are present, leading to a new scientific research field
with a relevant follow-up in several practical applications.
129
Molecular descriptors are numerical indices encoding some information related to the molecular structure.
They can be both experimental physicochemical properties of molecules and theoretical indices calculated by
mathematical formulas or computational algorithms.
Molecular descriptors, tightly connected to the molecular structure, play a fundamental role in scientific
research, being the theoretical core of a complex network of knowledge, as is shown in Figure 1. Indeed,
molecular descriptors are based on several different theories, such as quantum chemistry, information theory,
organic chemistry, and graph theory, and are used to model several different properties of chemicals in
scientific fields such as toxicology, analytical chemistry, physical chemistry, medicinal, pharmaceutical, and
environmental chemistry.
Moreover, to obtain reliable estimates of molecular properties, data elucidation, and data mining, molecular
descriptors are processed by several methods provided by statistics, chemometrics, and chemoinformatics. In
particular, chemometrics for about 30 years has been developing classification and regression methods able to
provide although not always reliable models, for both reproducing the known experimental data and
predicting the unknown data. The modeling process usually has not only explanatory purposes but also
predictive purposes. The interest in predictive models able to give effective reliable estimates has been largely
growing in the last few years as they are more and more considered useful and safer tools for predicting data on
chemicals.
Quantitative structureactivity relationships (QSARs) are the final result of the process that starts with a
suitable description of molecular structures and ends with some inference, hypothesis, and prediction on the
behavior of molecules in environmental, biological, and physicochemical systems in analysis (Figure 2).
QSARs are based on the assumption that the structure of a molecule (e.g., its geometric, steric, and electronic
properties) must contain the features responsible for its physical, chemical, and biological properties and on the
ability to capture these features into one or more numerical descriptors. Using QSAR models, the biological
activity (or property, reactivity, etc.) of a newly designed or untested chemical can be inferred from the
molecular structure of similar compounds whose activities (properties, reactivities, etc.) have already been
assessed.
Besides the well-known approach called QSARs, other specific approaches aimed at relating the molecular
structure to some experimental (or calculated) properties are quantitative structurereactivity relationships
(QSRRs), quantitative shapeactivity relationships (QShARs), the molecular shape being considered as a
component of the molecular structure, quantitative structurechromatographic relationships (QSCRs), quan-
titative structuretoxicity relationships (QSTRs), quantitative structurebiodegradability relationships
(QSBRs), quantitative similarityactivity relationships (QSiARs), quantitative structureenantioselective
Graph theory, discrete mathematics, physical chemistry,
information theory, quantum chemistry, organic chemistry,
differential topology, algebraic topology
Molecular descriptors
Derived from .
QSAR/QSPR, medicinal chemistry, pharmacology, genomics,
drug design, toxicology, proteomics, analytical chemistry,
environmetrics, virtual screening, library searching
Applied in .
Statistics,
chemometrics,
chemoinformatics
Processed by .
Figure 1 General scheme of the relationships among molecular structure, molecular descriptors, chemometrics, and
QSAR/QSPR.
130 Chemometrics in QSAR
retention relationships (QSERRs), and so on. Generally speaking, the quantitative structureproperty relation-
ship (QSPR) acronymous is used when any property different from biological activity is modeled.
Despite the differences among the approaches defined above, in the literature the most common terms
referring to all these approaches are QSAR and QSPR, with a unique simple distinction between activity and
property. These will also be the main terms used in this chapter, without any further distinctions.
It has been nearly 45 years since the QSAR modeling was first introduced into the practice of agrochemistry,
drug design, toxicology, and industrial and environmental chemistry. Its growing power in the following years
may be mainly attributed to the rapid and extensive development in methodologies and computational
techniques that have allowed to delineate and refine the many variables and approaches used to model
molecular properties.
111
Furthermore, the interest in QSAR is more and more growing because nowadays
these tools are used not only for research purposes but also to produce data on chemicals in the interest of time
and cost effectiveness.
Chemometrics is largely applied in QSAR research, both from a methodological and from a technical point
of view. Indeed, it provides tools and ideas to describe molecular structures and model their properties with a
continuous attention to the basic chemometric philosophy, based on model validation, information synthesis by
new indices, and graphical representation of data information.
4.05.2 Short History of QSAR and Molecular Descriptors
The history of QSAR and molecular descriptors is closely related to the history of what can be considered one
of the most important scientific concepts of the last part of the nineteenth century and the whole of twentieth
century, that is, the concept of molecular structure.
The years between 1860 and 1880 were characterized by a strong dispute about the concept of molecular
structure, arising from the studies on substances showing optical isomerism and the studies of Kekule (186167)
on the structure of benzene. The concept of the molecule thought of as a three-dimensional (3D) body was first
proposed by Butlerov (186165), Wislicenus (186973), Vant Hoff (187475), and Le Bel (1874). The
publication in French of the revised edition of La chimie dans l espace by Vant Hoff in 1875 is considered a
milestone of the 3D conception of the chemical structures.
QSAR history started a century earlier than the history of molecular descriptors, being closely related to the
development of the molecular structure theories. QSAR modeling was born in toxicology field. Attempts to
quantify relationships between chemical structure and acute toxic potency have been part of the toxicological
literature for more than 100 years. In the defense of his thesis entitled Action de lalcohol amylique sur
lorganisme at the Faculty of Medicine, University of Strasbourg, France, on 9 January 1863, Cros noted that a
relationship existed between the toxicity of primary aliphatic alcohols and their water solubility. This
Molecules
Experiments
Physicochemical
properties
Molecular
descriptors
Theory
QSPR
Experiments
QSAR
Biological
activities
Figure 2 General scheme of the QSAR/QSPR philosophy.
Chemometrics in QSAR 131
relationship demonstrated the central axiom of structuretoxicity modeling, that is, the toxicity of substances is
governed by their properties, which are determined in turn by their chemical structure. Therefore, there are
interrelationships among structure, properties, and toxicity.
Crum-Brown and Fraser (186869)
1214
proposed the existence of a correlation between biological activity
of different alkaloids and their molecular constitution. More specifically, the physiological action of a substance
in a certain biological system () was defined as a function (f ) of its chemical constitution (C):
= f C ( ) (1)
Thus, an alteration in chemical constitution, C, would be reflected by an effect on biological activity .
This equation can be considered the first general formulation of a QSAR.
A few years later, a hypothesis on the existence of correlations between molecular structure and
physicochemical properties was reported in the work of Korner (1874),
15
which dealt with the synthesis
of disubstituted benzenes and the discovery of ortho, meta, and para derivatives. The different colors of
disubstituted benzenes were thought to be related to their differences in molecular structure. Ten years
later, Mills (1884)
16
published a study On Melting Point and Boiling Point as Related to Composition in
the Philosophical Magazine.
The quantitative propertyactivity models, commonly referred to mark the beginning of systematic QSAR/
QSPR studies,
17
have come out from the search for relationships between the potency of local anesthetics and
the oil/water partition coefficient
18
between narcosis and chain length,
19
and narcosis and surface tension.
20
In
particular, the concepts developed by Meyer and Overton are often referred to as the MeyerOverton theory of
narcotic action.
18,19
The first theoretical QSAR/QSPR approaches date back to the end of 1940s and are those that relate
biological activities and physicochemical properties to theoretical numerical indices derived from the mole-
cular structure.
On the basis of the graph theory, the Wiener index
21
and the Platt number,
22
proposed in 1947 to model the
boiling point of hydrocarbons, were the first theoretical molecular descriptors based on the graph theory.
In the early 1960s, new molecular descriptors were proposed, giving the start to systematic studies on the
molecular descriptors, mainly based on the graph theory.
2332
The use of quantum-chemical descriptors in QSAR studies date back to early 1970s,
29
although quantum-
chemical descriptors were defined and used a long time before in the framework of quantum-chemistry. During
193060, the milestones were the works of Pauling
33,34
and Coulson
35
on the chemical bond, Sanderson
36
on
electronegativity, and Fukui et al.
37
and Mulliken
38
on electronic distribution.
Once the concept of molecular structure was definitively consolidated by the successes of quantum
chemistry theories and the approaches to the calculation of numerical indices encoding molecular structure
information were accepted, all the constitutive elements for the take-off of QSAR strategies were available.
Based on the Hammett equation,
39,40
the seminal work of Hammett gave rise to the o, culture in the
delineation of substituent effects on organic reactions, whose aim was the search for linear free energy
relationships (LFERs):
41
steric, electronic, and hydrophobic constants were defined, becoming a basic tool
for modeling properties of molecules.
In the 1950s, the fundamental works of Taft
4244
in physical organic chemistry were the foundation of
relationships between physicochemical properties and solutesolvent interaction energies (linear solvation
energy relationships, LSERs), based on steric, polar, and resonance parameters for substituent groups in
congeneric compounds.
In the mid-1960s, led by the pioneering works of Hansch,
23,45,46
the QSAR/QSPR approach began to assume
its modern look.
In 1962, Hansch et al.
45
published their study on the structureactivity relationships of plant growth
regulators and their dependency on Hammett constants and hydrophobicity. Using the octanol/water system,
a whole series of partition coefficients were measured, and thus, a new hydrophobic scale was introduced for
describing the attitude of molecules to move through environments characterized by different degrees of
hydrophilicity such as blood and cellular membranes. The delineation of Hansch models led to explosive
development in QSAR analysis and related approaches.
3
132 Chemometrics in QSAR
In the same years, Free and Wilson
47
developed a model of additive substituent contributions to biological
activities, giving a further push to the development of QSAR strategies.
They proposed to model a biological response on the basis of the presence/absence of substituent groups on
a common molecular skeleton.
47,48
This approach, called de novo approach when presented in 1964, was based
on the assumption that each substituent gives an additive and constant effect to the biological activity regardless
of the other substituents in the rest of the molecule.
At the end of 1960s, a lot of structureproperty relationships were proposed based not only on substituent
effects but also on indices describing the whole molecular structure. These theoretical indices were derived
from a topological representation of molecule, mainly applying the graph theory concepts, and then usually
referred to as 2D descriptors.
The fundamental works of Balaban,
49,50
Randic,
51,52
and Kier et al.
53
led to further significant developments
of the QSAR approaches based on topological indices (TIs).
As a natural extension of the topological representation of a molecule, the geometrical aspects of a molecule
were taken into account since the mid-1980s, leading to the development of the 3D-QSAR, which exploits
information on the molecular geometry. Geometrical descriptors were derived from the 3D spatial coordinates
of a molecule, and among them, there were shadow indices,
54
charged partial surface area descriptors,
55
weighted holistic invariant molecular (WHIM) descriptors,
56
gravitational indices,
57
EigenVAlue (EVA)
descriptors,
58
3D-MoRSE descriptors,
59
EEVA descriptors,
60
and GEometry, Topology, and Atom-Weights
AssemblY (GETAWAY) descriptors.
61
In the late 1980s, a new strategy for describing molecule characteristics was proposed, based on molecular
interaction fields (MIFs), which are composed of interaction energies between a molecule and probes, at
specified spatial points in 3D space. Different probes (such as a water molecule, methyl group, and hydrogen)
were used for evaluating the interaction energies in thousands of grid points where the molecule was
embedded. As final result of this approach, a scalar field (a lattice) of interaction energy values characterizing
the molecule was obtained. The first formulation of a lattice model to compare molecules by aligning them in
3D space and extracting chemical information from MIF was proposed by Goodford
62
in the GRID method and
then by Cramer et al.
63
in the comparative molecular field analysis (CoMFA).
Still based on MIFs, several other methods were successively proposed, and among them, there were
comparative molecular similarity indices analysis (CoMSIA),
64
Compass method,
65
G-WHIM descriptors,
66
Voronoi field analysis,
67
VolSurf approach,
68
and GRIND descriptors.
69
Finally, an increasing interest of the scientific community has been showing in recent years for combinatorial
chemistry, high-throughput screening, substructural analysis, and similarity searching, for which several
similarity/diversity approaches have been proposed mainly based on substructure descriptors such as mole-
cular fingerprints.
10,11,70
4.05.3 Chemometrics and QSAR Modeling
The development of QSAR/QSPR models is a quite complex process, as outlined in Figure 3. Once the
research goal has been clearly defined, which in most cases means defining the property to be modeled, that is,
the endpoint, the decision to be made concerns how much general the final model should be. This entails the
selection of the set of molecules the modeling procedure is applied to. For a long time, QSAR models were
developed on sets of congeneric compounds, that is, molecules with a common parental structure and different
substituent groups. Later, the interest in producing tools for quick molecular property estimations moved
forward more general QSAR models suitable for diverse molecules belonging to different chemical classes, that
is, not congeneric sets. The final decision in defining the molecule set mainly depends on the foreseen use of the
model and availability of experimental data.
In this phase of the QSAR process, it is of primary concern to gain an exhaustive knowledge about the
compounds in analysis with specific regard to the endpoint of interest. This obviously implies the acquisition of
reliable experimental data regarding the endpoint and possibly already existing models. Data of the chemicals
can be produced experimentally or retrieved from literature. In both cases, accuracy should be carefully
evaluated: the limiting factor in the development of QSAR/QSPR models is the availability of high quality
Chemometrics in QSAR 133
experimental data, because the accuracy of the property estimated by a model cannot exceed the degree of
accuracy of the input data. Moreover, when data are collected from literature, to avoid an additional variability
into the data because of different sources of information, data should be taken just from one source or from
almost comparable sources.
Another important phase of the QSAR process is the definition of a reliable chemical space; in other words,
the selection of those structural features is thought to be the most responsible for modeling the endpoint in
analysis. This implies the selection of proper molecular descriptors but, in most cases, there is no a priori
knowledge about which molecular descriptors are the best. Then, the tendency is to use a huge number of
descriptors, which hopefully include the candidate variables for modeling, and later apply a variable selection
technique. Two basic strategies can be adopted: (1) the use of algorithms to select the optimal subset(s) of
descriptors and (2) the use of chemometric methods (e.g., principal component analysis (PCA) or partial least
squares (PLS)) able to condense the large amount of available chemical information into a few principal
variables.
The next step is the selection of the validation procedure, which, in addition to the fitting performance
of the model, allows the evaluation of the model prediction ability. The latter is usually considered the
most important characteristic for an acceptable QSAR model. The predictive ability of the models is
evaluated by dividing the compounds into the training set, that is, the set by which the model is calculated,
and the test set, that is, the set of compounds by which the model predictive ability is evaluated. The
partition into training/test sets is performed in different ways, depending on the validation procedure (see
Section 4.05.7.3).
Exploratory data analysis is a common preliminary step in all the QSAR/QSPR studies. In particular, PCA
and clustering methods (both hierarchical and nonhierarchical) are the most commonly used. A wide impor-
tance has been gaining in these last years by the clustering approach based on the Kohonen maps (or self-
organizing maps, SOMs), which is an artificial 2D neural network providing easy interpretable information
about similarity/diversity among objects.
71,72
By exploratory analysis, the QSAR expert can evaluate whether the chosen molecular descriptors are
suitable for describing the compounds in analysis and the chemical space is sufficiently represented.
Moreover, the tendency observed nowadays is to build a reference chemical space for large categories of
Experimental
responses
Fitting
Molecular
descriptors
QSAR, QSPR, ...
Training set
Set of
molecules
MODEL
Molecular
descriptors
New
molecules
Predicted new
responses
Reversible decoding
or inverse QSAR
Experimental
responses
Molecular
descriptors
Test set
Prediction
power
Figure 3 General scheme of the QSAR/QSPR strategy.
134 Chemometrics in QSAR
chemicals for which molecular properties are known by using methods such as PCA on molecular fingerprints.
Then, this chemical space is used to analyze similarities among groups of chemicals showing, for example,
groups of different biological activity, and to find which regions in the chemical space require to be more
explored by designing new molecular structures.
The majority of the QSAR strategies aimed at building models are based on regression and classification
methods, depending on the studied problem. For continuous properties, the typical QSAR/QSPR model is
defined as
P = f x
1
. x
2
. . . .. x
p
_ _
(2)
where P is the molecular property/activity, x
1
, . . ., x
p
are the p molecular descriptors, and f is a function
representing the relationship between response and descriptors. In most of the cases, the function f is not
a priori known and needs to be estimated.
Ordinary least squares (OLS) regression, also called multiple linear regression (MLR), is the most common
regression technique used to estimate the quantitative relationship between molecular descriptors and the
property. PLS regression is widely applied especially when there are a large number of molecular descriptors
with respect to the number of training compounds, as it happens for 4D-QSAR methods such as GRID and
CoMFA.
Several other methods play a fundamental role in QSARs such as principal component regression (PCR),
k-nearest neighbor (k-NN) regression, and stepwise regression (SWR), the last being the most applied method
to select a descriptor subset from a not too large set of candidate variables.
Regression techniques based on the artificial neural networks (ANNs) are also frequently used,
73
such as
backward propagation (BP) and radial basis functions (RBFs), and on ensemble approaches such as random
forests.
74,75
For discrete molecular properties, such as properties defining active/inactive compounds, the typical
classification model is defined as
C = f x
1
. x
2
. . . .. x
p
_ _
(3)
where C is the class which each object is assigned to under the application of the obtained model, x
1
, . . . , x
p
are the p molecular descriptors, and f is a function representing the relationship between class assignment and
descriptors. Note that also classification models are quantitative models, only the response C being a qualitative
quantity.
Besides the classical discriminant analysis (DA) and the k-NN methods, other classification methods widely
used in QSAR/QSPR studies are SIMCA, linear vector quantization (LVQ), PLSDA, classification and
regression trees (CARTs), and cluster significance analysis (CSA), specifically proposed for asymmetric
classification in QSARs. Other promising classification techniques have been added to the data analysis toolbox
in QSAR discovery, such as support vector machine (SVM),
76
embedded cluster modeling (ECM),
77
and
Classification and Influence Matrix Analysis (CAIMAN).
78
In the last few years, ranking methods were also introduced in the structureresponse correlation studies,
paying attention to rank the chemicals instead of reproducing some quantitative property. They are mainly
used to build priority list of chemicals;
7981
however, they were also proposed for modeling purposes.
82,83
Ranking methods are simply aimed at giving a rank to the studied objects, that is, these methods are able to
provide a global index allowing the ranking of the samples (total ordering methods).
Ranking methods based, for example, on desirability, utility, and dominance functions, allow reaching a total
ordering of the chemicals evaluating contemporarily more than one descriptor. Moreover, by adding the
relationship of incomparability among compounds to the total ordering, partial ordering can be obtained
resulting into the so-called Hasse diagram,
81
as shown in Figure 4.
The five objects (a, b, c, d, and e) are chemicals ranked on the basis of their toxicity, measured on different
organisms. The obtained Hasse diagram shows the hazard levels for these compounds. In particular, chemicals
a, b, and e (maximal elements, Level 3) result more dangerous than the chemicals c and d, but nothing is known
about which of these is absolutely the most dangerous, because they are incomparable. Moreover, the chemical
Chemometrics in QSAR 135
e (isolated element) is not comparable with all the other chemicals and thus, conventionally, is placed at the
highest level; two total orderings are obtained, namely (a, c, d) and (b, c, d), meaning, for example for the former,
that chemical a is more dangerous than the chemical c which, in turn, is more dangerous than chemical d
(minimal element).
Besides the simple ranking lists, totally or partially ordered, ranking models can be also used as a further tool
together with regression and classification models. In effect, a ranking model is defined as a relationship
between one or more dependent attributes (y), investigated experimentally and usually called criteria, and a set
of theoretically defined independent attributes (x), called model attributes, which are usually theoretical
calculated variables such as molecular descriptors.
82,83
This kind of model can be defined as
R
i
y
i1
. y
i2
. . . .. y
iK
( ) = f x
i1
. x
i2
. . . .. x
ip
_ _
1 _ R
i
_ n (4)
where R
i
is the rank of the ith object, f is a ranking function, K the number of dependent attributes, and p the
number of independent attributes. In other words, the ith object ranking obtained by the experimental attributes
y is reproduced by a set of independent model attributes. Then, the rank of a new chemical with respect to the
training set chemicals can be evaluated describing it with the model descriptors.
Once the model is calculated and properly validated, it can be used to estimate property values for new
molecules or obtain information about mechanism of action of a group of compounds or, in general, about
which structural features are responsible for a specific behavior of molecules. In the first case, the attention is
paid more to obtaining models with the highest predictive ability, regardless of the model interpretability.
Indeed, when the aim is to produce data on chemicals, the very important aspect is that the model is as reliable
as possible and not the reason why some molecular descriptors were selected in the model.
However, even when the predictive ability of the models was high, the estimated property should be taken
carefully because a molecule might be far from the model chemical space, and then, the response would be
the result of a strong extrapolation, resulting in an unreliable prediction. In order to cope with this problem, the
concept of applicability domain (AD) of a model came out as a relevant aspect for the evaluation of
the prediction reliability.
For some applications, the primary concern is the possibility to obtain information about molecular structure
fromQSAR/QSPRmodels. Any procedure capable to reconstruct the molecular structure or fragment starting from
molecular descriptor values is called reversible decoding (or inverse QSAR), that is, once molecular descriptors from
a structure representation are obtained, reversibility would lead to structures from descriptors.
81,84,85
Reversible decoding is of great importance because, once a QSAR model is established, optimal values of the
molecular property can be chosen and values of the model molecular descriptors calculated by using
the estimated QSAR model; if reversibility is feasible, then molecular descriptors lead to structures. The
possible molecular structures corresponding to the optimized descriptor values can be designed (and synthe-
sized). Unfortunately, this last operation is often a troublesome task when the model molecular descriptors are
not simply and not easily interpretable.
Reversibility is a highly desired property of a descriptor, but is not strictly essential for structureresponse
studies. In effect, if the QSAR model needs to be used for producing reliable data on chemicals, reversibility is
not a necessary requirement. On the contrary, if the model is used for drug design, the requirement of
reversibility needs to be fulfilled.
a b
e
c
d
a, b, e : maximals
d : minimals
a, b, e : incomparable alternatives
b, c, d : chain and d c b
a, c, d : chain and d c a
Level 3 a b e
Level 2
Level 1
c
d
Hazard
Figure 4 Example of Hasse diagram.
136 Chemometrics in QSAR
Furthermore, although the inverse QSAR requirement is a very useful property, it must be noted that it
could be substituted by a surrogate approach based on an inductive analysis of optimal values of the studied
property obtained from new molecules generated by some automatic algorithm.
Finally, to summarize what reported above, the development of a QSAR/QSPR model requires three
fundamental components: (1) a data set providing experimental measures of a biological activity or property
for a group of chemicals (i.e., the dependent variable of the model); (2) molecular descriptors, which encode
information about the molecular structures (i.e., the descriptors or the independent variables of the model); and
(3) mathematical methods to find the relationships between a molecule property/activity and the molecular
structure.
An example of QSAR modeling is given below.
86
The compounds in analysis are 30 monosubstituted
phenylacetanilides (Figure 5), whose substituents (X) are given in Table 1. The studied response is the
anticonvulsant activity log(1/ED
50
) of these chemicals.
The chemical structures were generated by dedicated software, and the geometry optimization was per-
formed by a semiempirical quantum method. On the basis of the generated chemical structures, five different
classes of molecular descriptors were calculated by a specific tool for molecular descriptor calculation:
constitutional, topological, geometrical, quantum-chemical, and interaction field descriptors. Removing all
the descriptors having a square correlation coefficient with the activity less than 0.063 resulted into a set of 304
molecular descriptors, which were successively processed by a heuristic algorithm based on MLR and variable
selection.
The following final two QSAR models were obtained:
log
1
ED
50
_ _
= 10.1446 4.0793?xsc 6.1638?avi 1.4332?A92
0.0094mas 9.2412?M2 0.1434mam
n = 30 R
2
= 0.789 Q
2
100
= 0.700 s = 0.14 F = 15.0
(5)
Figure 5 Monosubstituted phenylacetanilides.
Table 1 Phenylacetanilide substituents, experimental anticonvulsant activity, and calculated values from Equations (5)
and (6)
No X Exp. Equation (5) Equation (6) No X Exp. Equation (5) Equation (6)
1 H 3.77 3.57 3.64 16 m-COMe 3.95 3.65 4.00
2 m-Me 3.75 3.65 3.71 17 m-OAc 3.48 3.58 3.49
3 m-Et 3.67 3.58 3.60 18 m-OEt 3.42 3.56 3.46
4 m-F 3.34 3.52 3.35 19 m-OSO
2
Me 3.77 3.83 3.74
5 m-Cl 3.40 3.37 3.41 20 p-Me 3.26 3.67 3.42
6 m-Br 3.32 3.23 3.32 21 p-F 3.49 3.49 3.52
7 m-I 2.64 2.64 2.64 22 p-OH 3.72 3.64 3.73
8 m-CF
3
2.84 2.83 2.85 23 p-OMe 3.78 3.70 3.79
9 m-OH 3.58 3.64 3.54 24 p-COMe 3.51 3.71 3.44
10 m-NH
2
3.81 3.75 3.93 25 o-F 3.48 3.46 3.41
11 m-NHMe 4.03 3.82 4.00 26 o-OH 3.33 3.42 3.34
12 m-NHEt 3.91 3.83 3.85 27 o-NH
2
3.40 3.46 3.40
13 m-OMe 3.22 3.58 3.53 28 o-OMe 3.43 3.41 3.44
14 m-CN 3.44 3.46 3.51 29 o-NO
2
3.29 3.20 3.24
15 m-NO
2
3.62 3.68 3.63 30 o-COMe 3.41 3.34 3.44
Chemometrics in QSAR 137
log
1
ED
50
_ _
= 0.9172 0.174?mam 3.2783?A58 1.3221?xsc 5.1735?F113 7.5902?iox
1.7525?P64 3.406?M72 35.7998?F109 1.987?nsf 17.7329?F96
n = 30 R
2
= 0.962 Q
2
100
= 0.901 s = 0.06 F = 51.2
(6)
The model descriptors are xsc (maximum net charge for a C atom), avi (average free valence of I atoms), mas
(molecular mass), mam (molecular mass/number of atoms), iox (mass percent of I maximum net charge for a
I atom), nsf (minimum net charge for a F atom), A92 and A58 (sum of attractive electrostatic forces for grid
points 92 and 58, respectively), F96, F109, and F113 (sum of all electrostatic forces for grid points 96, 109, and
113, respectively), M2 and M72 (average parallax for grid points 2 and 72, respectively), and P64 (maximum
parallax for grid point 64).
Note that for two models, both fitting (R
2
) and prediction (Q
2
) abilities were estimated.
4.05.4 Specific QSAR Approaches
There are a lot of QSAR approaches in the literature, often reported with a large variety of names, which make
difficult to rationalize them into a well-defined classification system. An attempt to classify QSAR approaches
might be made by considering the objective to be reached by a QSAR approach, the type of molecular property is
modeled on, the type of molecular descriptors the model is composed of, and the mathematical method or
computational algorithm used to estimate the model parameters. Therefore, focusing on the objective of the
analysis, it is possible to distinguish, for example, among drug design, high-throughput screening, and molecular
similarity analysis. Paying more attention to the property, terms such as ADME (absorption, distribution,
metabolism, and elimination properties) analysis, environmental QSAR, LSERs, and binary-QSAR are commonly
used. 2D-QSAR, 3D-QSAR, and 4D-QSAR namely refer to the type of molecular descriptors they are based on.
Terms such as group contribution method (GCM), structural analysis, and grid-based QSAR technique mainly
derive from the specific applied methods. In the following, some QSAR approaches are explained more in detail.
4.05.4.1 Hansch Approach
There is a consensus among current predictive toxicologists that Corwin Hansch is the founder of modern
QSAR. In the classic article,
45
it was illustrated that, in general, biological activity for a group of congeneric
chemicals can be described by a comprehensive model:
log
1
C
50
_ _
= a b cS d (7)
where C, the toxicant concentration at which an endpoint is manifested (e.g., 50% mortality or effect), is related
to a hydrophobicity term, , an electronic term, c (originally the Hammett substituent constant, o), and a steric
term, S (typically Tafts substituent constant, E
s
), being d a general additional term depending on the kind of
property to be modeled.
In particular, the parameter , which is the relative hydrophobicity of a substituent, was defined as
= logP
X
logP
H
(8)
where P
X
and P
H
represent the partition coefficients of a derivative and the parent molecule, respectively. This
is a substituent constant denoting the difference in hydrophobicity between a parent compound and a
substituted analog and is usually replaced by the more general molecular term the log of the 1 octanol/
water partition coefficient, log K
ow
or log P.
A practical example
86
of Hansch approach is here given for the data shown in Table 1, referring to 30
phenylacetanilides (Figure 5) for which the anticonvulsant activity is known. The analysis was performed on
the following descriptors: log P, the octanol/water partition coefficient; o, the Hammett electronic constant of
138 Chemometrics in QSAR
the substituent; I
p
, an indicator variable that takes the value 1 for p-derivatives and 0 for the other compounds;
E
s
, the Taft steric constant for o-derivatives; and R, the electronic parameter for o-derivatives. The Hansch
QSAR models with four, five, and six independent variables are
log
1
ED
50
_ _
=2.280 0.264 logP ( )
2
1.222 logP ( ) 0.161o 0.079I
p
n = 30 R
2
= 0.490 s = 0.228 F = 5.99
(9)
log
1
ED
50
_ _
=2.311 0.290 logP ( )
2
1.309 logP ( ) 0.135o 0.057I
p
0.404E
s
n = 30 R
2
= 0.640 s = 0.195 F = 10.05
(10)
log
1
ED
50
_ _
=2.478 0.276 logP ( )
2
1.229 logP ( ) 0.353o 0.223I
p
0.278E
s
0.621R
n = 30 R
2
= 0.731 s = 0.172 F = 7.83
(11)
Note that these models contain a nonlinear term (log P)
2
and their predictive ability is not considered.
The contributions of Hammett and Taft together laid the basis for the development of the QSAR paradigm
by Hansch and Fujita, which combined the hydrophobic constants with Hammetts electronic constants to yield
the linear Hansch equation and its many extended forms.
4.05.4.2 FreeWilson Approach
The FreeWilson approach
47
is based on the assumption that a biological response can be modeled by additive
substituent contributions, that is, the substituent effects are considered independent of each other, the
compounds congenericity being also another basic requirement.
Once a common skeleton for the chemical analogs is defined, regression analysis is performed, considering a
number S of substitution sites R
s
(s =1, S), and for each site a number N
s
of different substituents. Hydrogen
atoms are also considered as substituents if present in a substitution site of some compounds. The FreeWilson
descriptors of the ith compound are indicator variables I
i,ks
where I
i,ks
=1 if the kth substituent is present in the
sth site and I
i,ks
=0 otherwise.
The FreeWilson model is defined as
y
i
= b
0

S
s=1

N
s
k=1
b
ks
I
i.ks
(12)
where b
0
is the intercept of the model corresponding to the average biological response calculated from the data
set and b
ks
are the regression coefficients. The biological response y is usually used in the form log(1/C), where C
is the concentration achieving a fixed effect. The regression coefficients b
ks
of the FreeWilson model give the
importance of each kth substituent in each sth site in increasing/decreasing the response with respect to the
mean response, that is, the activity contribution of the substituent.
A simple example of FreeWilson approach is given below, considering eight derivatives of toluene with two
substitution sites (X and Y) (Figure 6).
In the site X, ethyl, fluorine, chlorine, and bromine substituents are allowed (N
X
=4), whereas in the site Y,
only chlorine and bromine substituents are allowed (N
Y
=2). The eight possible derivatives are coded in the
FreeWilson approach as shown in Table 2.
Figure 6 Toluene parent molecule.
Chemometrics in QSAR 139
4.05.4.3 LSER Approach
LSERs constitute the basis on which effects of solventsolute interactions on physicochemical properties and
reactivity parameters are studied. In general, a property P of a species A in a solvent S can be expressed as
P
A.S
=

j
j
j
A. S ( ) (13)
where j are complex functions of both solvents and solutes.
87
By assuming that these functions can
be factorized in two contributions separately dependent on solute and solvent, the property can be
represented as
P
A.S
=

j
f
j
A ( )g
j
S ( ) (14)
where f are functions of the solute and g functions of the solvent.
The underlying philosophy of the LSER is based on the possibility to study these two functions, after a
proper choice of the reference systems and properties. Moreover, it has been recognized that solution proper-
ties P mainly depend on three factors: a cavity term, a polar term, and hydrogen-bond term:
P = intercept cavity termdipolarity,polarizability term hydrogen-bond term
Therefore, a typical LSER is expressed as
88
P
A.S
= b
0
b
1
c
2
H
_ _
1
V
2
b
2

+
1

+
2
b
3
c
1
u
2
b
4
u
1
c
2
(15)
where b are estimated regression coefficients, and the subscripts 1 and 2 in the solvent/solute property
parameters refer to the solvent S and the solute A, respectively. This equation is usually known as solvato-
chromic equation and the parameters of polarity/dipolarizability and hydrogen-bonding as solvatochromic
parameters. The term solvatochromic is derived from the origin of this approach referring to the effect solvent
has on the color of an indicator which is used for quantitative determination of some molecular attributes
(solvatochromic parameters).
From the general solvatochromic equation, two special cases can be encountered. When dealing with effects
of different solvents on properties of a specific solute, the general equation is explicitly on solvent parameters:
P
A.S
i
= b
0
b
1
c
2
H
_ _
1
b
2

+
1.i
b
3
c
1.i
b
4
u
1.i
(16)
This equation has been used in several correlations of solvent effects on solute properties such as reaction rates
and equilibrium constants of solvolyses, energy of electronic transitions, solvent-induced shifts in ultraviolet/
visible, infrared, and nuclear magnetic resonance spectroscopy, fluorescence lifetimes, formation constants of
hydrogen-bonded and Lewis acid/base complexes.
89
Table 2 FreeWilson matrix of eight toluene derivatives
Site X Y
Compound C
2
H
5
F Cl Br Cl Br
1 1 0 0 0 1 0
2 1 0 0 0 0 1
3 0 1 0 0 1 0
4 0 1 0 0 0 1
5 0 0 1 0 1 0
6 0 0 1 0 0 1
7 0 0 0 1 1 0
8 0 0 0 1 0 1
140 Chemometrics in QSAR
Conversely, when dealing with solubilities, lipophilicity, or other properties of a set of different solutes in a
specific solvent, the general equation is explicitly on the solute parameters:
P
A.S
= b
0
b
1
V
i
b
2

+
2.i
b
3
c
2.i
b
4
u
2.i
(17)
This equation has been mainly used in correlations of aqueous solubility of compounds, octanol/water partition
coefficients, and some other partition parameters together with some biological properties.
8991
Recently, the terms of the LSER equations were redefined by Abraham et al.
92
as
P = c e?E s?S a?A b?B l ?L (18)
where E is the solute excess molar refractivity, S is the solute dipolarity/polarizability, A and B are the overall or
summation of hydrogen bond acidity and basicity, respectively, and L is the logarithm of the gashexadecane
partition coefficient. The terms c, e, s, a, b, and l are the regression coefficients to be estimated.
4.05.4.4 Group Contribution Methods
GCMs search for relationships between structural properties and a physicochemical or biological response
based on the following general model:
P = f G
1
. G
2
. . . .. G
m
;n
1
. n
2
. . . .. n
m
( ) (19)
where the experimental property P for the compound is a function of m group contributions G
j
and their
occurrences n
j
.
93
The group contributions, also known as fragmental constants, are numerical quantities
associated with substructures of the molecule, such as single atoms, atom pairs, atom-centered substructures,
molecular fragments, and functional groups. For example, atom contribution models exhibit a one-to-one
correspondence between atoms and property contributions, that is, the molecular property is a function of all
the single atomic properties. The specification of the structural groups depends on the particular GCM scheme
adopted.
Generally, the application of GCM to a molecule requires the following steps:
1. Identification of all groups in the molecule applicable to the particular GCM scheme.
2. Calculation of fragmental constants measuring contributions to the molecular property of the considered
groups by employing the function associated with the particular GCM.
3. Evaluation of some correction factors that should account for interactions among molecular groups.
The group contributions are usually estimated by multivariate regression analysis on chemicals of known
properties, but they can also be experimental, theoretical, or user-defined quantities. When estimation of group
contributions is carried out by regression analysis, large training sets of chemicals are required to obtain reliable
estimates. Usually, a battery of group contributions (a set of scalar parameters) is defined taking into account
several structural characteristics of the molecules. If correction factors are accounted for, the GCM models are
usually called additive-constitutive models.
Linear GCM models are defined as the following:
y
i
= k
0

m
j =1
G
j
I
ij
or y
i
= k
0

m
j =1
G
j
n
ij
(20)
where k
0
is a model-specified constant, j runs over the m groups defined within the GCM scheme, G
j
is the
contribution of the jth group. I
ij
and n
ij
are substructure descriptors, and, namely, I
ij
is a binary variable taking a
value equal to 1 if the jth group is present in the ith molecule, 0 otherwise, and n
ij
is the number of occurrences
of the jth group in the ith molecule.
Chemometrics in QSAR 141
Nonlinear GCM models are usually defined as
y
i
= k
0

m
j =1
G
j
n
ij

m
j =1
G
j
n
ij
_ _
2
(21)
Moreover, mixed GCM models are defined by adding, usually, one or more descriptors of the whole molecular
structure to the group descriptors:
y
i
= k
0

m
j =1
G
j
n
ij

p
j =1

ij 9
(22)
where the second summation runs over the p molecular descriptors defined in the GCM scheme and
ij 9
is the
j 9th descriptor value for the ith molecule.
The group contribution approach was extensively applied for the estimation of the octanol/water partition
coefficient, which is a powerful lipophilicity descriptor. Examples are the Nys-Rekker method,
94
Broto-
Moreau-Vandycke log P,
95
Ghose-Crippen log P (ALOGP),
96
Moriguchi log P (MLOGP),
97
and Klopman
log P (KLOGP).
98
Furthermore, group contribution models were proposed for several molecular property estimations, such as
boiling and melting points,
99,100
molar refractivity,
101
pKa,
102
critical temperatures, solubilities,
103
soil sorption
coefficients,
104
and several thermodynamic properties.
105,106
Another well-known group contribution model is
that proposed by Atkinson for the evaluation of reaction rate constants with hydroxyl radicals of organic
compounds.
107
An example of GCM for the calculation of the topological polar surface area (TPSA) of molecules is given
below. TPSA is calculated according to the model proposed by Ertl et al.,
108
whose group contributions are
listed in Table 3.
Table 3 List of surface group contributions of polar atom types
No. Atom type PSA contribution (G) No. Atom type PSA contribution (G)
1 [N](
+
)(
+
)
+
3.24 23 [nH](:
+
):
+
15.79
2 [N](
+
)=
+
12.36 24 [n](:
+
)(:
+
):
+
4.10
3 [N]#
+
23.79 25 [n](
+
)(:
+
):
+
3.88
4 [N](
+
)(=
+
)=
+
(b) 11.68 26 [nH](:
+
):
+
14.14
5 [N](=
+
)#
+
(c) 13.60 27 [O](
+
)
+
9.23
6 [N]1(
+
)
+

+
1 (d) 3.01 28 [O]1
+

+
1 (d) 12.53
7 [NH](
+
)
+
12.03 29 [O]=
+
17.07
8 [NH]1
+

+
1 (d) 21.94 30 [OH]
+
20.23
9 [NH]=
+
23.85 31 [O]
+
23.06
10 [NH2]
+
26.02 32 [o](:
+
):
+
13.14
11 [N](
+
)(
+
)(
+
)
+
0.00 33 [S](
+
)
+
25.30
12 [N](
+
)(
+
)=
+
3.01 34 [S]=
+
32.09
13 [N](
+
)#
+
(e) 4.36 35 [S](
+
)(
+
)=
+
19.21
14 [NH](
+
)(
+
)
+
4.44 36 [S](
+
)(
+
)(=
+
)=
+
8.38
15 [NH](
+
)=
+
13.97 37 [SH]
+
38.80
16 [NH2](
+
)
+
16.61 38 [s](:
+
):
+
28.24
17 [NH2]=
+
25.59 39 [s](=
+
)(:
+
):
+
21.70
18 [NH3]
+
27.64 40 [P](
+
)(
+
)
+
13.59
19 [n](:
+
):
+
12.89 41 [P](
+
)=
+
34.14
20 [n](:
+
)(:
+
):
+
4.41 42 [P](
+
)(
+
)(
+
)=
+
9.81
21 [n](
+
)(:
+
):
+
4.93 43 [PH](
+
)(
+
)=
+
23.47
22 [n](=
+
)(:
+
):
+
(f) 8.39
An asterisk (
+
) stands for any non-hydrogen atom, for a single bond, = for a double bond, # for a triple bond, and : for an
aromatic bond; atomic symbol in lowercase means that the atomis part of an aromatic system. (b) As in nitro group. (c) Middle
nitrogen in azide group. (d) Atom in a three-membered ring. (e) Nitrogen in isocyano group. (f) As in pyridine N-oxide.
142 Chemometrics in QSAR
The TPSA of a molecule is determined by the summation of tabulated surface contributions of polar atom
types as
TPSA
i
=

m
j =1
G
j
n
ij
(23)
where the sum runs over the defined types of polar fragments (see Table 3), n
ij
is the frequency of the jth polar
fragment type in the ith molecule, and G
j
is the surface contribution of the jth fragment type. The surface
contributions were calculated by least-squares fitting of the TPSA-based fragments to the single conformer 3D
PSA of a training set consisting of 34 810 drug-like molecules taken from the World Drug Index database. The
statistical parameters of the model are R
2
=0.982 and s =7.83.
4.05.4.5 Cluster Significance Analysis
CSA is contemporarily a QSAR and a variable selection method, being proposed for determining which
molecular descriptors of a set of compounds are associated with a biological response. The active compounds
are expected to be similar to each other in the chemical space defined by the relevant descriptors and so will
cluster.
This approach, originally proposed for binary response variables,
109
was extended to the quantitative
biological responses, scaled between 0 and 1, with the name of generalized cluster significance analysis
(GCSA).
110
Let X be a data matrix of n rows (i.e., the compounds) and p columns (i.e., the descriptors) and y the vector of
the n biological responses. The mean square distance MSD
j
was proposed to measure the tightness of the cluster
of active compounds with respect to each jth molecular descriptor:
MSD
j
=

n 1
s=1

n
t =s1
y
s
y
t
x
sj
x
tj
_ _
2
n n 1 ( )
(24)
where n is the number of compounds, y
s
and y
t
the biological responses of compounds s and t, x
sj
and x
tj
the jth
descriptor values of the two compounds. A small MSD value indicates that the considered descriptor has a good
capability to cluster compounds with the same biological activity.
The MSD calculated as above is proportional to that calculated:
MSD
j
=

n
i=1
y
i
x
ij
x
j
W
_ _
2
(25)
where the weighted mean is calculated as
x
j
W
=

n
i=1
y
i
x
ij

n
i=i
y
i
(26)
To reach a statistical evaluation of the clustering capability of each descriptor, a test for significance is
performed using a random permutation of the responses and using the permuted values to recalculate MSD
values; this calculation is repeated N times (e.g., N=100 000). Then, for any given descriptor, the number c
j
of
times giving a value less than or equal to MSD
j
is used to obtain the significance level (p-value) and the
standard error s of this estimate.
p
j
=
c
j
N
(27)
s
j
=

p
j
1 p
j
_ _
N

(28)
Chemometrics in QSAR 143
The best descriptor is chosen based on the minimum p-value.
If some descriptors are being considered together, the corresponding MSD random values are added
together, as are the corresponding actual MSD values, before the count is taken.
Therefore, the selection of the best subset model can be performed by forward stepwise selection starting
from the variable with the lowest p-value (the current model); next, each of the variables that are not yet
included in the current model is added to it in turn, producing a set of candidates with corresponding p-values.
The candidate model with the lowest p-value is selected, and the process is repeated on the new current model.
4.05.4.6 Read-Across Approach
Recently adopted to feel in gaps on data of chemicals, read-across
111
is a nonformalized approach in which
endpoint information for one chemical (called a source chemical) is used to make a prediction of the endpoint
for another chemical (called a target chemical), which is considered to be similar in some way (usually on the
basis of structural similarity). In principle, read-across can be applied to characterize physicochemical proper-
ties, fate, human health effects, and ecotoxicity, and it may be performed in a qualitative or quantitative manner.
Read-across can either be qualitative or quantitative, depending on whether the data being used are categorical
or numerical in nature.
To estimate the properties of a given substance, read-across can be performed in a one-to-one manner (one
analog used to make an estimation) or in a many-to-one manner (two or more analog used). Within the context of a
chemical category, the read-across can also be performed in a one-to-many manner or in a many-to many manner.
4.05.5 Molecular Descriptors
In the last decades, several scientific researches have been focused on studying how to catch and convert by a
theoretical pathway the information encoded in the molecular structure into one or more numbers used to
establish quantitative relationships between structures and properties, biological activities or other experi-
mental properties. Molecular descriptors are formally mathematical representations of a molecule obtained by a
well-specified algorithm applied to a defined molecular representation or a well-specified experimental
procedure: The molecular descriptor is the final result of a logic and mathematical procedure which trans-
forms chemical information encoded within a symbolic representation of a molecule into a useful number or the
result of some standardized experiment.
112
Molecular descriptors play a fundamental role in chemistry, pharmaceutical sciences, environmental protec-
tion policy, toxicology, ecotoxicology, health research, and quality control. Evidence of the interest of the
scientific community in the molecular descriptors is provided by the huge number of descriptors proposed up
today: more than 3000 descriptors
112
derived from different theories and approaches are actually defined and
computable by using dedicated software tools.
Each molecular descriptor takes into account a small part of the whole chemical information contained in the
real molecule, and, as a consequence, the number of descriptors is continuously increasing with the increasing
request of deeper investigations on chemical and biological systems.
Different descriptors are different ways or perspectives to viewa molecule, taking into account the various features
of its chemical structure. By nowmolecular descriptors have become one among the most important variables used in
molecular modeling, and, consequently, managed by statistics, chemometrics, and chemoinformatics.
The availability of the molecular descriptors has not only been a new opportunity to search for new
relationships but also been a great change of the research paradigm in this field: in effect, the use of the
molecular descriptors calculated by theories has permitted for the first time to link experimental knowledge
to theoretical information arising from the molecule structure. While until 1960s70s molecular modeling
mainly consisted in searching for mathematical relationships between experimentally measured quantities,
nowadays it is mainly performed searching for relationships between a measured property and molecular
descriptors able to catch structural chemical information (Figure 2).
A general consideration about the use of molecular descriptors in modeling problems concerns their
information content. This depends on the kind of molecular representation used and the defined algorithm
144 Chemometrics in QSAR
for its calculation. There are simple molecular descriptors derived by counting some atom types or structural
fragments in the molecule, as well as physicochemical and bulk properties such as molecular weight, number of
hydrogen bond donors/acceptors, and number of OH-groups.
Other molecular descriptors are derived from algorithms applied to a topological representation and usually
called topological or 2D descriptors. Other molecular descriptors are derived from the spatial (x, y, z) coordinates
of the molecule, usually called geometrical or 3D descriptors; another class of molecular descriptors, called 4D
descriptors, is derived from the interaction energies between the molecule, imbedded into a grid, and some probe.
In Figure 7, a (very) simplified scheme of the major classes of molecular descriptors is shown.
It is true that geometrical 3D/4D descriptors have a higher information content than other simpler
descriptors, such as counting descriptors or topological descriptors, which often show relevant levels of
degeneracy. Then, several people think that it is better to use the most informative descriptors in all modeling
processes. This thinking is incorrect because the best descriptors are those whose information content is
comparable with the information content of the response for which the model is searched for. In effect, too high
information in the independent variables (the descriptors) with respect to the response is often seen as noise on
behalf of the model, thus giving instable or not predictive models. For example, a property whose values are
equal or similar for isomeric structures is better modeled by a simple descriptor with degenerate values for
isomeric structures. In this case, descriptors able to discriminate among the isomeric structures have a
redundant information which cannot be integrated in the model. In conclusion, it can be stated that the best
descriptor(s) valid for all the problems does not exist.
In general, molecular descriptors, besides the trivial invariance properties, should satisfy some basic
requirements. A list of desirable requirements of chemical descriptors suggested by Randic
113
is shown in
Table 4.
A lot of software calculates wide sets of different theoretical descriptors, from SMILES, 2D graphs to 3D
x, y, z spatial coordinates. Some of the most popular software are mentioned here: ADAPT,
114
OASIS,
115
Molecular graph
Graph invariants
Topostructural
descriptors
Topochemical
descriptors
Topographic
descriptors
Topological information indices
2D
Atom list
0D
Counting Summing
Grid-based QSAR
techniques
Interaction energy
values
4D
Substructure list
1D
Counting
Molecular geometry
x, y, z coordinates
Geometrical
descriptors
Quantum-chemical
descriptors
Steric/bulk
descriptors
Molecular surface
descriptors
3D
Structural keys
Figure 7 General scheme of the different sources of molecular descriptors.
Chemometrics in QSAR 145
CODESSA,
116
MolConn-Z,
117
and DRAGON.
118
A website wholly dedicated to the molecular descriptors was
created in 2007 by the Milano Chemometrics and QSAR Research Group (http://www.moleculardes-
criptors.eu), where, together with information about software and books, news and tutorials concerning the
molecular descriptors are provided.
4.05.5.1 Molecular Structure Representations
The molecular representation is the way in which a molecule, that is, a phenomenological real body, is
symbolically represented by a specific formal procedure and conventional rules. The quantity of chemical
information that is transformed to the molecule symbolic representation depends on the kind of
representation.
119,120
The simplest molecular representation is the chemical formula, which is the list of the different atom types,
each accompanied by a subscript representing the number of occurrences of the atoms in the molecule. For
example, the chemical formula of p-chlorotoluene is C
7
H
7
Cl, indicating the presence in the molecule of 15
atoms distinguished into N
C
=7, N
H
=7, and N
Cl
=1. This representation is independent of any knowledge
concerning the molecular structure, and hence, molecular descriptors obtained from the chemical formula can
be called 0D descriptors. Examples are the atom number, molecular weight, atom-type count, and, in general,
constitutional descriptors and any function of the atomic properties.
The atomic properties constitute the weights used to characterize molecule atoms; the most common atomic
properties are atomic mass, atomic charge, covalent and van der Waals radii, atomic polarizability, and
hydrophobic atomic constants.
The substructure list representation can be considered as a 1D representation of a molecule and consists of a
list of structural fragments of a molecule; the list can be only a partial list of fragments, functional groups, or
substituents of interest, thus not requiring a complete knowledge of the molecule structure. The descriptors
derived by this representation can be referred to as 1D descriptors and are typically used in substructural
analysis and substructure searching with a common name of molecular fingerprints.
The 2D representation of a molecule considers how the atoms are connected, that is, it defines the connectivity
of atoms in the molecule in terms of the presence and nature of chemical bonds. Approaches based on the
molecular graph allow a 2D representation of a molecule, usually known as topological representation.
A molecular graph is usually denoted as G =(V , E ), where V is a set of vertices which correspond to the
molecule atoms and E is a set of elements representing the binary relationship between pairs of vertices;
unordered vertex pairs are called edges, which correspond to bonds between atoms.
A molecular graph obtained excluding all the hydrogen atoms is called H-depleted molecular graph, whereas
a molecular graph where hydrogen atoms are also included is called H-filled molecular graph (or, simply,
molecular graph). In Figure 8, examples of H-depleted molecular graphs are given for 2-methyl-3-butenoic
acid, 1-ethyl-2-methyl-cyclobutan, and 5-methyl-1,3,4-oxathiazol-2-one.
Table 4 List of desirable requirements for molecular descriptors
No. Descriptors
1 Should have structural interpretation
2 Should have good correlation with at least one property
3 Should preferably discriminate among isomers
4 Should be possible to apply to local structure
5 Should possible to generalize to higher descriptors
6 Descriptors should be preferably independent
7 Should be simple
8 Should not be based on properties
9 Should not be trivially related to other descriptors
10 Should be possible to construct efficiently
11 Should use familiar structural concepts
12 Should have the correct size dependence
13 Should change gradually with gradual change in structures
146 Chemometrics in QSAR
The molecular graph depicts the connectivity of atoms in a molecule irrespective of the metric parameters
such as equilibrium interatomic distances between nuclei, bond angles, and torsion angles. Thus, a molecular
graph is a topological representation of the molecule, and it is from this that a lot of molecular descriptors are
derived. These are 2D descriptors and usually are graph invariants known with the name of TIs.
Two-dimensional representations alternative to the molecular graph are the linear notation systems, such as
Wiswesser Line Notation (WLN) system
121
and SMILES notation.
122
The 3D representation views a molecule as a rigid geometrical object in space and allows a representation
not only of the nature and connectivity of the atoms but also of the overall spatial configuration of the molecule.
This representation of a molecule is called geometrical representation and defines a molecule in terms of atom
types constituting the molecule and the set of (x, y, z) coordinates associated to each atom. Figure 9 shows a
geometrical representation of lactic acid. Molecular descriptors derived from this representation are called 3D
descriptors, and examples are the geometrical descriptors, several steric descriptors, and size descriptors.
Several molecular descriptors derive from multiple molecular representations and can then be classified with
difficulty. For example, graph invariants derived from a molecular graph weighted by properties obtained by
computational chemistry are both 2D and 3D descriptors.
The bulk representation of a molecule describes the molecule in terms of a physical object with 3D attributes
such as bulk and steric properties, surface area, and volume.
The stereoelectronic representation (or lattice representation) of a molecule is a molecular description
related to those molecular properties arising from electron distribution, interaction of the molecule with probes
characterizing the space surrounding them (e.g., MIFs). This representation is typical of the GRID-based
QSAR techniques. Descriptors at this level can be considered 4D descriptors, being characterized by a scalar
field, that is, a lattice of scalar numbers, derived from the 3D molecular geometry (Figure 10).
Finally, the stereodynamic representation of a molecule is a time-dependent representation that adds
structural properties to the 3D representations, such as flexibility, conformational behavior, and transport
properties. Dynamic QSAR is an example of a multiconformational approach.
123,124
4.05.5.2 0D Descriptors or Count Descriptors
All the molecular descriptors for which no information about molecular structure and atom connectivities is
needed belong to the class of 0D descriptors. Atom and bond counts, as well as sum or average of the atomic
properties are typical of this class of descriptors. These descriptors can be always easily calculated, are naturally
interpreted, do not require optimization of the molecular structure, and are independent of any conformational
O
O
O S
N
O
3
4
6
2
3
4
6
7
1
6
1
5
7
4
3
2
5 5
7
2
1
Figure 8 Some molecular graph representations of molecules.
Figure 9 The 3D structure representation of a molecule.
Chemometrics in QSAR 147
problem. They usually show a very high degeneration, that is, they have equal values for several molecules,
such as isomers. Their information content is low, but nevertheless they can play an important role in modeling
several physicochemical properties or take a part into more complex models.
4.05.5.3 1D Descriptors or Fingerprints
All the molecular descriptors that can be calculated from substructural information about the molecule belong
to the 1D descriptors. Counting of functional groups and substructure fragments, as well as atom-centered
descriptors, are the most known 1D descriptors. These descriptors are often presented as fingerprint, that is, a
binary vector where 1 indicates the presence of the defined substrutcure and 0 its absence. A relevant advantage
in describing molecules by fingerprints is the possibility to perform quick calculations for molecule similarity/
diversity problems.
Like 0D descriptors, these descriptors can be usually easily calculated, are naturally interpreted, do not
require optimization of the molecular structure, and are independent of any conformational problem. They
usually show a medium-high degeneration and are often very useful in modeling both physicochemical and
biological properties.
4.05.5.4 2D Descriptors or Topological Descriptors
TIs are molecular descriptors based on a graph representation of the molecule and represent graphtheoretical
properties that are preserved by isomorphism, that is, properties with identical values for isomorphic graphs. A
graph invariant may be a characteristic polynomial, a sequence of numbers, or a single numerical index
obtained by the application of algebraic operators to matrices representing molecular graphs and whose values
are independent of vertex numbering or labeling.
TIs are usually derived from a H-depleted molecular graph. They can be sensitive to one or more structural
features of the molecule such as size, shape, symmetry, branching, and cyclicity and can also encode chemical
information concerning atom type and bond multiplicity. In fact, TIs are usually divided into two categories:
topostructural and topochemical indices.
125
Topostructural indices encode only information on the adjacency
and distance of atoms in the molecular structure; topochemical indices quantify information on topology but
also specific chemical properties of atoms such as their chemical identity and hybridization state.
Figure 10 A lattice of grid point with an embedded molecule.
148 Chemometrics in QSAR
Topological information indices are graph invariants, based on information theory and calculated as
information content of specified equivalence relationships on the molecular graph.
In general, TIs do not uniquely characterize molecular topology; different structures may have some of the
same TIs. A consequence of TIs nonuniqueness is that they do not, in general, allow reconstructing molecule.
There are several ways to obtain topological descriptors. Simple TIs consist in the counting of some specific
graph elements; examples are the Hosoya Z index,
126
path counts,
127
walk counts, self-returning walk counts,
28
Kier shape descriptors,
128
path/walk shape indices.
129
However, the most common TIs are derived by applying
some algebraic operators (e.g., the Wiener operator) to a matrix representation of the molecular structure, such
as adjacency and distance matrices. Among them are the Wiener index,
130
spectral indices,
131
and Harary
indices.
132
Molecular matrices are the most common mathematical tool to encode structural information of molecules.
Very popular molecular matrices are the graphtheoretical matrices, a huge number of which were proposed in
the last decades in order to derive TIs and describe molecules from a topological point of view. Graph
theoretical matrices are matrices derived from a molecular graph G (often from a H-depleted molecular graph).
A comprehensive collection of graphtheoretical matrices is reported by Janezic et al.
133
Vertex matrices are
undoubtedly the graphtheoretical matrices most frequently used for characterizing a molecular graph. The
matrix entries encode different information about pairs of vertices such as their connectivities, topological
distances, sums of the weights of the atoms along the connecting paths; the diagonal entries can encode
chemical information about the vertices. From vertex matrices a huge number of TIs were proposed.
Other topological molecular descriptors can be obtained by using suitable functions applied to local vertex
invariants (LOVIs), these being numerical representations of the atoms derived from molecular graphs. The
most common functions are atom and/or bond additives, resulting in descriptors that correlate well physico-
chemical properties that are atom and/or bond additives themselves. For example, Zagreb indices,
31
Randic
connectivity index,
134
related higher-order connectivity indices,
135
and Balaban distance connectivity
indices
136
are derived according to this approach.
Particular TIs are derived from weighted molecular graphs where vertices and/or edges are weighted by
quantities representing some 3D features of the molecule, like those obtained by computational chemistry. The
graph invariants obtained in this way encode both information on molecular topology and molecular geometry.
BCUT descriptors
137
are an example of these topological descriptors. Graph invariants have been successfully
applied in characterizing the structural similarity/diversity of molecules and in QSAR/QSPR modeling.
4.05.5.5 3D Descriptors or Geometrical Descriptors
Another class of molecular descriptors, called geometrical or 3D descriptors, is derived from a geometrical
representation of the molecule, that is, from xyz Cartesian coordinates of the molecule atoms. Some of the
most known geometrical descriptors are here shortly presented.
WHIM
56
descriptors are molecular descriptors based on statistical indices calculated on the projections of
the atoms along principal axes of the molecule. They are built in such a way as to capture relevant molecular
3D information regarding molecular size, shape, symmetry, and atom distribution with respect to invariant
reference frames. The algorithm consists in performing a PCA on the centered Cartesian coordinates of a
molecule by using a weighted covariance matrix obtained from different weighting schemes for the atoms. For
each weighting scheme, a set of statistical indices is calculated on the atoms projected onto each principal
component, that is, the scores.
Gravitational indices
57
are geometrical descriptors reflecting the mass distribution in a molecule, defined as
G
1
=

A 1
i=1

A
j =i1
m
i
m
j
r
2
ij
(29)
G
2
=

B
b=1
m
i
m
j
r
2
ij
_ _
b
(30)
Chemometrics in QSAR 149
where m
i
and m
j
are the atomic masses of the considered atoms, r
ij
the corresponding interatomic distances,
A and B the number of atoms and bonds of the molecule, respectively. The G
1
index takes into account all atom
pairs in the molecule, whereas the G
2
index is restricted to pairs of bonded atoms. These indices are related to
the bulk cohesiveness of the molecules accounting, simultaneously, for both atomic masses (volumes) and their
distribution within the molecular space.
EVA descriptors
58
were proposed to extract chemical structural information from mid- and near-infrared
spectra. The approach is to use, as a multivariate descriptor, the vibrational frequencies of a molecule, a
fundamental molecular property characterized reliably and easily from the potential energy function. The EVA
descriptor is a function of the eigenvalues obtained from the normal coordinate matrix; it corresponds to the
fundamental vibrational frequencies of the molecule, which can be calculated using standard quantum or
molecular mechanical methods from computational chemistry.
The EEVA descriptors
60
are analogous to the EVA descriptors, but semiempirical molecular orbital energies,
that is, the eigenvalues of the Schrodinger equation, are used instead of the vibrational frequencies of the
molecule.
3D-MoRSE descriptors
59
are based on the idea of obtaining information from the 3D atomic coordinates by
the transform used in electron diffraction studies for preparing theoretical scattering curves. The derived
expression is the following:
I s ( ) =

A 1
i=1

A
j =i1
w
i
w
j
sin sr
ij
_ _
sr
ij
(31)
where I(s) is the scattered electron intensity, w an atomic property (e.g., the atomic number), r
ij
the interatomic
distance between the ith and the jth atoms, and A the number of atoms. Radial distribution function (RDF)
descriptors
138
are based on the distance distribution in the geometrical representation of a molecule and
constitute a RDF code that shows certain characteristics in common with the 3D-MoRSE descriptors.
The GETAWAY
61
descriptors are derived from the molecular influence matrix (H), which is a representa-
tion of the molecular structure, defined as
H = M M
T
M
_ _
1
M
T
(32)
where M is the molecular matrix consisting of the centered Cartesian coordinates x, y, z of the molecule atoms
in a chosen conformation. Atomic coordinates are assumed to be calculated with respect to the geometrical
center of the molecule in order to obtain translational invariance. The molecular influence matrix is a
symmetric A A matrix, where A represents the number of atoms, and shows rotational invariance with respect
to the molecule coordinates, thus resulting independent of molecule alignment rules. The diagonal elements h
ii
of this matrix range from 0 to 1 and encode atomic information related to the influence of each molecule atom
in determining the whole shape of the molecule; in effect, mantle atoms always have higher h
ii
values than
atoms near the molecule center. GETAWAY descriptors are obtained by using double-weighted autocorrela-
tion functions, where one weighting scheme is the leverage and the other an atomic property (e.g., atomic mass).
As a geometrical representation involves the knowledge of the relative positions of the atoms in 3D space,
that is, the (x, y, z) atomic coordinates of the molecule atoms, geometrical descriptors usually provide more
information and discrimination power also for similar molecular structures and molecule conformations than
topological descriptors. Despite their high information content, geometrical descriptors usually show some
drawbacks. They require geometry optimization and therefore the overhead to calculate them. Moreover, for
flexible molecules, several molecule conformations are available: on one hand, new information is available and
can be exploited, but, on the other hand, the problem complexity can significantly increase.
For these reasons, topological descriptors, fingerprints based on fragment counts, and other simple descrip-
tors are usually preferred for the screening of large databases of molecules. On the contrary, searching for
relationships between molecular structures and complex properties, such as biological activities, can often
efficiently be performed by using geometrical descriptors, exploiting their large information content.
Moreover, it is important to remember that the biologically active conformation of the studied chemicals is
seldom known. Some authors overcome this problem by using a multiconformation dynamic approach.
123,124
150 Chemometrics in QSAR
4.05.5.6 4D Descriptors or Grid-Based Descriptors
GRID
62
and CoMFA
63
approaches were the first methods based not uniquely on the molecular structure but on
the calculation of the interaction energy between molecule and probe. The focus of these approaches is to
identify and characterize quantitatively the interactions between the molecule and the receptors active site.
They place the molecules in a 3D lattice constituted by several thousands of evenly spaced grid points and
use a probe (steric, electrostatic, hydrophilic, etc.) to map the surface of the molecule on the basis of the
molecule interaction with the probe.
QSAR models are obtained by the application of PLS regression to the interaction field matrix. It should be
noted that the use of the grid points as molecular descriptors requires the careful step of aligning the considered
molecules in such a way that each of the thousands of grid points represents, for all the molecules, the same kind
of information and not spurious information because of the lack of invariance in the rotation of the molecules in
the grid.
Besides the two most popular methods GRID and CoMFA, the other known methods based on this approach
are CoMSIA,
64
Compass,
65
G-WHIM descriptors,
66
Voronoi Field Analysis,
67
SOMFA,
139
VolSurf descrip-
tors,
68
and GRIND.
69
Although these descriptors are often called 3D descriptors, they can be more properly
called 4D descriptors (or grid-based descriptors) because to geometrical information is added another source of
information given by the interaction energy with a specific probe at each point of a 3D grid embedding the
molecule. Therefore, the molecular descriptors are the MIFs generated by probes. These scalar fields can be
efficiently visualized and used to think visually about new drug candidates, thus resulting very helpful in the
drug discovery process.
140,141
An advantage of these approaches is that final results show where and how to modify the compounds to reach
the desirable values of the studied molecular property. On the contrary, a drawback is the need of molecular
alignment in order to achieve molecular comparability and the selection of the most appropriate conformation.
The alignment determines to what extent the descriptors differ from one molecule to the next.
Consequently, it substantially influences the results of the evaluation. Hence, significant and relevant results
can only be expected if the alignment was carried out properly and unambiguously. Often, the need for an
alignment limits the application of certain descriptors to homogeneous data sets, and even then the alignment
is not always easily performed. As a consequence, different research groups started to develop alignment-
independent molecular descriptors. The first set of descriptors based on scalar fields but alignment indepen-
dent were G-WHIM descriptors,
66
based on the theoretical principles of the WHIM descriptors
56
but applied
to the MIFs. VolSurf
68
and GRIND
69
descriptors are also independent of any previous alignment of the
molecules.
4.05.6 Molecular Descriptor Selection
In the last few years, a great attention of the scientific community has been paid to the techniques devoted to
the variable selection, namely the molecular descriptor selection in QSARs. As there are thousands of
descriptors available for describing a molecule and often there is no a priori knowledge about which molecular
features are more responsible for a specific property, subsets of the most appropriate descriptors are searched
for by using different strategies.
It is now inconfutable that model reliability is affected not only by the presence of noise, correlated or
redundant descriptors, but also by the presence of irrelevant descriptors. Therefore, variable selection techni-
ques are largely used to remedy this situation and improve the accuracy and the prediction power of
classification or regression models.
The exhaustive search, sometimes called all possible models (APMs), can be applied in all but the simplest
cases, the search space being impractical when there are a number of molecular descriptors: in effect, given p
candidate variables, the number of APMs containing a number k of variables between 1 and V (V<p) is
Total number of models =

V
k=1
p!
k! p k ( )!
_ _
< 2
p
(33)
Chemometrics in QSAR 151
The total maximum number of models obtained from p candidate variables is exactly 2
p
1 (the model with
zero variables is not counted, it being not very useful!). For example, given 50 candidate variables, the total
number of possible models containing from 1 to 5 variables is 2 369 935. Two main approaches can be used for
extracting nonredundant but relevant variables from the pool of available variables: the variable reduction and
the variable subset selection (VSS).
4.05.6.1 Variable Reduction
When the molecular descriptors to be used in a model are chosen on the basis of general principles and not
accounting for a specific goal (i.e., some experimental property to model), the term variable reduction can be
more properly used than variable selection. By variable reduction techniques, molecular descriptors are chosen
by comparison among the descriptors themselves regardless of the specific molecular property that needs to be
modeled.
For instance, descriptors can be selected on the basis of their information content (e.g., the Shannon entropy
or the most commonly used standardized entropy): descriptors with high information content are more
effective in discriminating different molecules and, thus, are expected to be more effective in modeling any
property of molecules. Moreover, in order to avoid redundant information, the check of descriptor pairwise
correlations is advisable. One of the two descriptors having an absolute correlation value higher than a
predefined threshold (often selected in the range 0.900.99) has to be discarded, but how to choose which of
the two descriptors is better to delete from the data set? A good solution may be to delete the descriptor showing
the highest average correlation with the other descriptors in the data set or the lowest variance or entropy.
Variable reduction can also be performed by multivariate techniques such as PCA-based feature selection,
iteratively deleting variables with the largest loadings in the last components or retaining the variables with the
largest loadings in the first components.
142,143
Moreover, all the clustering methods applied on the transposed matrix of the original data, where descriptors
become rows and molecules columns, can be used for variable reduction purposes. Representative descriptors
are chosen (one or more) from each cluster. Together with the classical clustering methods (k-means, Jarvis-
Patrick method, hierarchical clustering, etc.), the SOMs, such as the Kohonen maps, are nowadays very popular
because of their efficiency and simple use. Optimal design techniques, such as D-optimal design, can also be
used for the same purposes.
Other variable reduction techniques are based on the ranking of the molecular descriptors according to their
global correlation with the other descriptors in the data set. To this regard, the method called K-correlation
analysis exploits the K-multivariate correlation index for the iterative ranking from the most correlated to the
least correlated descriptors.
144,145
In this approach, the K-correlation of the data set, constituted by p 1
variables after deleting one variable, is calculated and such value is attributed to the excluded variable. The
procedure is repeated excluding in turn one variable at a time. Then, the variable showing the lowest K-value is
definitively eliminated. The whole procedure is repeated on the remaining variables until only two variables
remain. In other words, at each step the variable that shows the highest global correlation with all the other
variables is excluded from the data set.
4.05.6.2 Variable Subset Selection
VSS techniques, unlike variable reduction techniques, take into account the specific property to be modeled.
For instance, in regression analysis, these techniques aim at finding the subset of molecular descriptors that lead
to the best predictive model for the studied property.
A number of different variable selection techniques are nowadays available: besides the classical SWR,
proposed by Efroymson
146
in the late 1960s and based on alternating forward selections and backward
eliminations, other more powerful techniques were devised and are largely used for variable selection purposes.
Owing to the huge number of possible combinations of descriptors, this high-complexity problem is often
solved by machine learning techniques: Genetic Algorithms (GAs),
147,148
Simulated Annealing (SA),
149
Tabu
152 Chemometrics in QSAR
Search (TS),
150
and Evolutionary Programming (EP)
151
are the most common in QSAR research. More recent
techniques, and thus less known, are Artificial Ants,
152
Particle Swarm,
153
and an approach based on Projection
Pursuit using robust estimators.
154
Several modifications of the original PLS regression method were also proposed with the aim to perform
variable selection and, among them, Iterative Variable Selection for PLS (IVS-PLS)
155,156
and Uniformative
Variable Elimination by PLS (UVE-PLS)
157
are the most popular.
The general approach to the VSS is shown in Figure 11. The first step (A) is the definition of the algorithm
performing the selection of one or more variables within the whole set of candidate variables. This step can be
performed by selecting the variables by a random strategy or by using a genetic strategy (based on repeated
reproduction and mutation steps), or other approaches. Then, from each subset of variables, a model is
calculated.
The second step (B) is the evaluation of the quality associated to each model by using proper optimization
functions (often called fitness functions). In this phase, both the method for estimating the models and the
fitness function to be optimized have been previously defined.
The most popular regression methods used in the model estimation are OLS (or MLR), PLS, BP-ANN,
k-NN estimator.
In regression studies, the most popular fitness function is the prediction ability (Q
2
) based on leave-one-out
(LOO) or leave-more-out (LMO), even if the LOO procedure is the most common during the model selection
phase.
However, the acceptability of a final regression model (step C) should not be evaluated simply looking at its
prediction ability but considering also additional rules. For instance, models whose differences between R
2
and
Q
2
(obtained by the LOO procedure) are too large
158
should be rejected because a significant decrease in the
prediction ability can be expected in their practical use on new chemicals. Therefore, in order to prevent the
acceptability of not real predictive models and/or chance correlated models, severe optimization functions
need to be used as the AIC index,
159
the LOF function,
160
the FIT function,
161
and the RQK functions,
162
these
last including more than one rule for the model acceptability.
The iterative step (D) depends on the chosen variable selection technique. During the iterative procedure,
the conditions for the stop are checked, and the accepted models are properly managed.
Estimate of the model optimization function
Subset of selected descriptors
Evaluation of the model acceptability
V
a
r
i
a
b
l
e

s
u
b
s
e
t

s
e
l
e
c
t
i
o
n

t
e
c
h
n
i
q
u
e
A
B
C
D
Check of the predictive ability of the selected models
E
Set of candidate descriptors
Figure 11 General scheme of the variable selection approach.
Chemometrics in QSAR 153
The simplest VSS techniques preserve only the best model at each step, providing a final unique model at the
end of the optimization procedure. Other strategies (typically, e.g., those based on the GAs) provide a
population of accepted models or more than one population of models.
The final selected model(s) is processed (E) further to check its effective predictive ability and the eventual
presence of chance correlation. To this end, strong validation procedures are applied such as bootstrap or using
an external data set; chance correlation can be evaluated by the Y-randomization test (see 4.05.7.3).
An example of a variable selection algorithm is given in Figure 12, as proposed by Zheng and Tropsha.
164
The whole optimization procedure is managed by the SA algorithm. The crucial step B is performed by the
LOO validation technique and the fitness function is the Q
2
evaluated by the k-NN algorithm. This procedure
was applied for variable selection, but the best final model, applicable for reliable prediction, was selected after
validation on an external data set.
163,164
4.05.6.3 Consensus Modeling
Owing to the large availability of different models predicting the same molecular property, such as those
models selected by GAs, the consensus modeling strategy
165167
can be used in order to produce more reliable
estimates of the studied property. This strategy can be applied for both regression and classification purposes.
Consensus analysis consists in selecting not just one model, but more than one. Predictions are performed
contemporarily using the average response obtained fromall the selected models or, better, using the weighted average
response, considering as the statistical weight the leverage h
k
of the object from each kth model, as Equation (34):
y =

M
k=1
y
k
h
k

M
k=1
1
h
k
(34)
y =

M
k=1
w
k
y
k
h
k
_ _

M
k=1
1
h
k
(35)
where M is the number of selected models and y
k
is the response estimated by the kth model.
167
The leverage is
a measure of the distance of the object from the model, that is, small leverages correspond to objects well
Subset of randomly selected descriptors
Q
2
estimate of the model
Select the best model
Leave-one-out compound
Prediction of the response of the excluded
compound by the average k-NN estimate
S
i
m
u
l
a
t
e
d

a
n
n
e
a
l
i
n
g
Figure 12 Variable subset selection based on simulated annealing and k-NN method.
154 Chemometrics in QSAR
represented by the model, whereas high leverages represent objects far from the model, thus the response likely
being extrapolated and less reliable. In Equation (35), the weight w
k
takes into account the quality (e.g., the Q
2
LOO) of each model and is defined as
w
k
=
Q
2
k

M
k=1
Q
2
k
(36)
A consensus modeling can also be performed by evaluating, for each sample, the standard deviation of the
responses predicted by the selected models, thus obtaining a measure of the convergence of all the selected
models toward a unique response.
Once a set of acceptable models has been obtained, the models to be used for consensus analysis can be chosen
simply taking into account the variables in each model, possibly preferring models with simple and interpretable
variables. In this stage, attention has to be paid to models with variables that are different but highly correlated among
themselves, because the average prediction fromthese models will be biased toward a reduced source of information.
In order to avoid the selection of models that are only seemingly diverse, because of the presence of different
descriptors, but closely correlated among themselves, a measure of distance between two models can be accounted
for.
167,168
This distance, called model distance, allows the selection of models which are really diverse and, then, to
perform a consensus analysis taking into account different molecular characteristics. The model distance takes
into account the correlation of variables within and between models and allows the finding of clusters of similar
models, catching the most diverse models in such a way as to preserve maximum information and diversity.
Comparing two models means comparing two p-dimensional binary vectors where each position corresponds
to a variable. The most common way to represent the relationships between two binary vectors, represented
here by models A and B, is a two-way table as shown in Table 5, a the number of cases with 1 in the same
position in both vectors, d the number of cases with 0 in the same position in both vectors, b the number of cases
such that for a given position there is 1 in vector A and 0 in vector B, c the number of cases such that for a given
position there is 1 in vector B and 0 in vector A. Therefore, b and c represent the number of variables not shared
by the two models, b is the number of variables in model A but not in model B, and c the number of variables in
model B but not in model A. The degree of similarity between the two models is in some way related to a and d,
whereas their degree of diversity is related to b and c.
The most common distance measure for two binary vectors I
A
and I
B
, which represent two models A and B is
the Hamming distance d
H
defined as
d
H
= b c (37)
where b and c are the numbers defined earlier. It has been demonstrated that the Hamming distance usually
overestimates the distance between two models, neglecting the variable correlations.
In order to measure the distance between two binary vectors I
A
and I
B
also accounting for variable
correlation, the model distance can be calculated as the following.
As the first step, all the pairs of variables of a model having a correlation equal to 1 have to be identified and
excluded from further calculations. Note that, if the models to be analyzed have been searched for by any
variable selection procedure based on least squares regression, the case of pairs of variables in a model with a
Table 5 Two-way table collecting
variable frequencies between two binary
vectors, represented by models A and B
Model B
1 0
1 a b
Model A
0 c d
Chemometrics in QSAR 155
correlation equal to 1 is not possible. In any case all these redundant variables should definitely be excluded
from the model, together with the common variables of the two models which are deleted for practical reasons.
At this point, the number of diverse variables in the two models is calculated, this number being b9 c9
resembling that used for the Hamming distance even if the symbols b9 and c9 replace b and c because, in this case,
a preliminary variable reduction has been made.
To better explain this stage, let us look at an example. Suppose a set of 10 ordered variables is given, let the
model A (I
A
) be constituted by six variables and the model B (I
B
) by four variables, with two common variables
(x
3
and x
9
), then their binary vector representations are
I
A
= 0 0 1 1 1 1 1 0 1 0 [ [ I
B
= 1 0 1 0 0 0 0 0 1 1 [ [
and the corresponding phenotypic representations:
A: x
3
. x
4
. x
5
. x
6
. x
7
. x
9
and B: x
1
. x
3
. x
9
. x
10
Now suppose that the variables x
4
and x
5
of the model A have a correlation equal to 1 and the same holds for
variables x
9
and x
10
of the model B. Therefore, in both models one of the two variables, either x
4
or x
5
in model
A and either x
9
or x
10
in model B, has to be deleted. Moreover, also the common variables have to be deleted,
namely x
3
and x
9
(or x
10
which is the same as x
9
).
Then, the reduced models will be composed of the following variables:
A: x
5
. x
6
. x
7
and B: x
1
and their binary vectors will become
I
A
= 0 0 0 0 1 1 1 0 0 0 [ [ I
B
= 1 0 0 0 0 0 0 0 0 0 [ [
It results that b9 =3 and c9 =1. For these reduced models the Hamming distance is equal to 4, whereas for the
original models it would be 6.
The second step of the procedure deals with the evaluation of the correlation among all the variables of the
two reduced models. It involves the calculation of the cross-correlation matrix C
AB
, which contains
the correlations between all the possible pairs of variables of the two models. This matrix has b9 rows, that is,
the number of variables in the reduced model A, and c9 columns, that is, the number of variables in the reduced
model B. The counterpart of C
AB
(size b9 c9) is the cross-correlation matrix C
BA
(size c9 b9).
The cross-correlation matrix can be transformed into a symmetric matrix as the following:
Q
A
= C
AB
C
BA
b9. b9 ( ) (38)
Q
B
= C
BA
C
AB
c9. c9 ( ) (39)
The nonzero eigenvalues of both matrices Q
A
and Q
B
coincide and the sum r
AB
of the square root of these
eigenvalues ` gives the desired information related to the intermodel variable correlation:
r
AB
=


`
j
_
(40)
Finally, the model distance d
2
M
is derived modifying the Hamming distance as follows:
d
2
M
A. B ( ) = b9 c9 2


`
j
_
= b9 c9 r
AB
(41)
As is easily seen, if no preliminary variable reduction is carried out, that is, b9 =b and c9 =c, and no correlation
exists between the two variable blocks, that is, r
AB
=0, the model distance coincides with the Hamming distance.
The model distance satisfies the first two main postulates for a distance measure:
1. d
ij
=d
ji
2. d
ii
=0
Moreover, it was observed that the model distance does not always satisfy the triangles inequality:
d
ij
d
jk
_ d
ik
thus belonging to the class of non-Euclidean distances.
156 Chemometrics in QSAR
Consensus modeling has been employed in different QSAR studies,
169172
often giving better statistical fits
and predictive abilities with respect to the single models; moreover, consensus analysis has also been shown to
diminish the effect of noisy data.
173
4.05.7 Principles for QSAR Modeling
In recent years, the basic philosophy underlying QSAR modeling has been changing to account for new needs
QSAR models should satisfy in order to be effectively used.
First of all, in order to be reproducible, all the models have to be fully described; in other words, the methods
used for their calculation and assessment have to be well defined, as well as molecular descriptors appearing in
the models, the modeled property, and the chemicals used in the training set. Furthermore, for several years,
QSAR models have been performed not only on congeneric data sets but also on noncongeneric sets of
compounds, because of the need to obtain more general relationships and exploit the great potential of big
data sets of compounds nowadays available. Moreover, evaluation of model AD has been greatly recognized as a
safe tool to predict responses for new chemicals avoiding extrapolation, and validation has by now entered the
common practice of QSAR modeling. In effect, QSAR models are accepted only if validated, that is, some
predictive ability parameter has to be estimated for a reliable use of the model on new compounds.
All the general principles to properly produce valid QSAR modes were recently taken into account by the
OECD (http://www.oecd.org/document/23/0,2340,en_2649_201185_33957015_1_1_1_1,00.html) and for-
mally declared fundamental tools to estimate data on chemicals by QSARs.
The New Chemicals Policy of the European Commission (REACH) (http://eur-lex.europa.eu/
LexUriServ/site/en/oj/2006/l_396/l_39620061230en00010849.pdf) explicitly states that at chemical regis-
tration level the registrant should include information from alternative sources (e.g., from QSARs)
which may assist in identifying the presence or absence of hazardous properties of the substance and which
can, in certain cases, replace the results of animal tests. Obviously, for the purposes of the REACH legislation, it
is essential to use QSAR models that produce reliable estimates, that is, validated QSAR models (http://
ecb.jrc.it/qsar/). Model validation has been the subject of much recent debate in the scientific and regulatory
communities.
163,164,174182
After the REACH legislation, it was considered important to develop an inter-
nationally recognized set of principles for QSAR validation, to provide regulatory bodies with a scientific
basis for making decisions on the acceptability of QSAR estimates of regulatory endpoints and promote the
mutual acceptance of QSAR models.
Several principles for assessing the validity of QSARs were proposed in 2004 by the OECD Work
Programme on QSARs and are actually known as the OECD Principles for QSAR model validation and
regulatory purposes (http://www.oecd.org/dataoecd/33/37/37849783.pdf):
To facilitate the consideration of a (Q)SAR model for regulatory purposes, it should be associated with the following
information: 1) a defined endpoint; 2) an unambiguous algorithm; 3) a defined domain of applicability; 4) appropriate
measures of goodness-offit, robustness and predictivity; 5) a mechanistic interpretation, if possible.
Some considerations about the basic principles for QSAR modeling will be discussed later.
4.05.7.1 Unambiguous Model Algorithm
For a QSAR model to be acceptable in chemical regulations, it must be clearly defined, easily and continuously
applicable in such a way that the calculations for the prediction of the endpoint can be reproduced by everyone,
also for new chemicals. Thus, the unambiguous algorithm is characterized not only by the mathematical
method of calculation used but also by the specific molecular descriptors required in the model mathematical
equation. Therefore, the exact procedure used to calculate the descriptors, including compound pretreatment
(e.g., energy minimization and partial charge calculation), the software employed, and the variable selection
method for QSAR model development should be considered integrative parts of the overall definition of an
unambiguous algorithm.
Chemometrics in QSAR 157
4.05.7.2 Applicability Domain
The concept of AD concerns the predictive use of QSAR/QSPR models and, then, is closely related to the
concept of model validation. In other words, the AD is a concept related to the quality of the QSAR/QSPR
model predictions and prevention of the potential misuse of models results. A key component of the quality
prediction is to define when a QSAR/QSPR model is suitable to predict a property/activity of a new
compound, that is, define models AD.
164,174,176178,180,181
A model will yield reliable predictions when model assumptions are fulfilled and unreliable predictions when
they are violated. In particular, for QSAR/QSPR models, based on statistical mining techniques, the training set
and the model prediction space are the basis for estimation of chemical space where predictions are reliable.
Two basic approaches were proposed to evaluate the AD. The first approach to AD evaluation is the analysis
of the training set, which has its grounds in statistics, because the interpolated prediction results are more
reliable than extrapolated. Extrapolation is not a problem in principle, because extrapolated results from
theoretical well-founded models can often be reliable. However, QSAR/QSPR models are usually based on
empirical and limited experimental evidence and/or are only locally valid; therefore, extrapolation always
results in higher uncertainty and usually in unreliable predictions.
Different approaches to estimate interpolation regions in a multivariate space were evaluated by
Jaworska,
178,179
based on (1) ranges of the descriptor space; (2) distance-based methods, using Euclidean,
Manhattan, and Mahalanobis distances, Hotelling T
2
method, and leverage values; and (3) probability density
distribution methods based on parametric and nonparametric approaches. Both ranges and distance-based
methods were also evaluated in the principal component space.
One of the common tools used to visualize the AD of a QSAR model is the plot of standardized residuals in
prediction (r
i
) versus leverage values (h
i
) for each ith sample. This plot, called Williams plot, allows an
immediate and simple graphical detection of both the response outliers (i.e., compounds with standardized
residuals in prediction greater than three standard deviation units, r
i
>3o) and structurally influential chemi-
cals in a model (h
i
>h
+
), where h
+
is a threshold value, usually 2 or 3 times the average leverage value. In effect,
when the leverage value of a compound is lower than the critical value h
+
, the probability of accordance
between predicted and actual values is as high as that for the training set chemicals. Conversely, a high leverage
chemical is structurally distant from the other chemicals; thus, it can be considered outside the AD of the model.
Figure 13 shows the Williams plot of a model for polar narcotics in Pimephales promelas as an example.
183
Here, chemical 347 is wrongly predicted (r
i
>3o); it is a test chemical completely outside the AD of the model,
because its leverage value is beyond the vertical leverage threshold line; thus, it is both a response outlier and a
high leverage chemical.
Two other chemicals (squares at 0.35 h) slightly exceed the critical leverage value but are close to three
chemicals of the training set (rhombus), slightly influential in the model development. The predictions for these
test chemicals can be considered as reliable as those of the training chemicals. Chemical 283 is wrongly
predicted (r
i
>3o), but in this case it belongs to the model AD, being within the cutoff leverage value.
Therefore, although the predicted response for chemical 347 should not be accepted because not reliable,
prediction for chemical 283 should be.
Another approach to AD evaluation is based on the similarity/diversity, evaluated in the model descriptor space,
of the considered compound with respect to those belonging to the training set; in fact, a QSAR/QSPR prediction
should be reliable if the compound is in some way similar to one or more compounds present in the training
set.
184
High similarity is simply another way to use the interpolation ability of the model in place of the extrapolation.
A stepwise procedure was also proposed
177
based on a four stage procedure. General parametric requirements
are imposed in the first stage, specifying in the domain only those chemicals that fall in the range of variation of
the physicochemical properties of the chemicals in the training set. Such properties (e.g., molecular weight,
absorption, water solubility, and volatility) are not usually the driving forces for the studied phenomenon, but they
may implicitly affect the measured endpoint, for example, by reducing the bioavailability of chemicals. The
second stage defines similarity measures that can be used to quantify the structural similarity between pairs of
molecules. Atom-centered fragments are the molecular descriptors used to determine such a similarity. The third
stage in defining the domain is based on a mechanistic understanding of the modeled phenomenon. This goal is
very difficult to reach because structure and mathematical formalism of the model, computational method used
158 Chemometrics in QSAR
for its derivation, accepted hypotheses, and so forth should be taken into account. The suggested approach is an
attempt to reduce the diversity in this matter, where the analysis is focused on functional groups whose reactivity
modulates the studied endpoint and structural fragments used in group contribution models. Finally, the
reliability of simulated metabolism (metabolites, pathways, and maps) is taken into account in assessing the
reliability of predictions, if metabolic activation of chemicals is a part of the QSAR model.
In any case, regardless of the specific method chosen for AD evaluation, this is always a very important task
in order to avoid unreliable predictions and a misuse of the results.
4.05.7.3 Validation
Since several years, model validation has become the crucial part of the development of QSAR models, because
the main interest of the people has focused on the use of effective predictive models.
163,164,174,182,185191
In chemometrics, the term validation assumes its specific meaning in the framework of finding and producing
models and consists in the evaluation of the predictive ability of the model and in detecting model pathologies
such as chance correlation, redundancy, and useless model complexity.
In general terms, model validation can be carried out by using a subset of the available data as the training set
to build the model and the remaining part of the data as the test set to evaluate the predictive ability of the
model, comparing the test set experimental responses with the predicted ones (Figure 14).
Several statistical parameters are used to estimate the model quality. Among these, the most popular
parameters are the coefficient of determination R
2
, measuring the model fitting ability (Equation (42)), and
the corresponding coefficient Q
2
, measuring the model predictive ability (Equation (43)):
R
2
= 1
RSS
TSS
= 1

n
i=1
y
i
y
i
( )
2

n
i=1
y
i
y
i
( )
2
0 _ R
2
_ 1 (42)
R
2
cv
X Q
2
= 1
PRESS
TSS
= 1

n
i=1
y
i
y
i,i
_ _
2

n
i=1
y
i
y
i
( )
2
Q
2
_ 1 (43)
Hat
S
t
a
n
d
a
r
d
i
z
e
d

r
e
s
i
d
u
a
l
s
4
3
2
1
0
1
2
3
4
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6
Training set
Prediction set
347
283
Figure 13 Williams plot for an externally validated model for polar narcotics (leverage cutoff value: 2.5 h
+
). Reproduced
from Papa, E.; Villa, F.; Gramatica, P. Statistically Validated QSARs, Based on Theoretical Descriptors, for Modeling Aquatic
Toxicity of Organic Chemicals in Pimephales promelas (Fathead Minnow). J. Chem. Inf. Model. 2005, 45, 12561266.
Chemometrics in QSAR 159
where RSS is the residual sum of squares; PRESS is the predictive error sum of squares; y and y are the
experimental and estimated responses, respectively; the notation i/i indicates that the response of the ith
sample is evaluated when the ith sample is not participating to the model building, that is, it is not included in
the training set. The average response y is calculated from the training set samples and TSS stands for the total
sum of squares.
The R
2
is the most widely used measure of the ability of a QSAR model to reproduce the data in the training
(goodness of fit), but nothing is known of its predictivity. It is important to remember that, in contrast to the fitting
parameter R
2
, which increases as more and more descriptors are added (until there is dangerous overfitting), the
value of Q
2
generally increases only when the added predictors are useful in predicting left out compounds. Both
R
2
and Q
2
are often reported as percentage, that is, in terms of the percentage of explained variances.
Several validation techniques were proposed to estimate model predictivity, thus leading to different Q
2
.
Validation techniques differ among themselves in how the objects are partitioned into training and test sets.
The most natural approach to the model validation is the so-called training/test splitting, that is, a
technique based on the extraction from the original data set of a percentage of objects (usually from 10 to
50%) which therefore do not participate to the model building but are only used to test the model prediction
ability. The training/test splitting should be performed randomly thousands of times, with leaving a fixed
percentage of objects in the test set. A unique training/test splitting randomly performed is not suggested
because of the too high dependence of the results on the resulting unique random splitting.
192
Without doubt, LOO and LMO cross-validation techniques are the most popular techniques for internal
validation.
191,193195
By these techniques, each object or each group of objects is put only once in the test set. By
the LOO technique, each object is in turn left out, the model built by using the n 1 remaining objects, and the
prediction calculated for the object left out.
By the LMO technique, the data set is partitioned into k cross-validation groups (usually from 2 to 10), each
containing a number n
v
of objects (approximately n/k); at each run the model is built using the n n
v
objects of
the training set. The responses of each cross-validation group are predicted using the partial model built by the
objects belonging to the remaining groups.
The predictive power (Q
2
LOO
and Q
2
LMO
, from LOO and LMO procedures, respectively) is calculated by
summing the squared differences obtained for the n predictions.
As it has been demonstrated that for infinite samples,
196
Q
2
LOO
tends to R
2
, it is obvious that Q
2
LOO
can very
often be too optimistic, and thus, it results in an unreliable estimate of the prediction power.
An interesting variant of the LMO technique is the Monte Carlo LMO cross-validation (MCCV), by which
the partition of the objects into cross-validation groups is carried out randomly and repeated several times.
Bootstrap validation
185,186,197
is another commonly used technique by which the training set is repeatedly
thousands of times built by including the same initial number (n) of objects, but selected from the raw data
with replacement, that is, repeated objects are allowed in the training set. Therefore, if repeated objects are
allowed, each test set is composed of the objects that are not selected in the training set. This procedure allows
to build models from training sets always having the same size and to check the model predictive performance
on test sets having different objects and sizes.
Final model
External set
Data set
Training set Test set
Fitting Internal prediction
External set
External prediction
Partial
models
Figure 14 General scheme of the validation of QSAR/QSPR models.
160 Chemometrics in QSAR
Bagging validation method, introduced by Breiman,
198
is a modification of the bootstrap method, both with and
without replacement, which produces samples of smaller size than the original one (often n/2). Bagging involves
replacing an estimator by the average (or mode) of the values it takes when computed from B resamples.
Boosted leave-many-out (boosted LMO)
199
is a validation method used to systematically vary the balance
between representativeness and diversity of training and test sets, proposed in the framework of CoMFA. It is a
nonparametric method employed to create external data sets including balanced sampling across the response
range and/or structural classes and maximizing training set diversity by a predefined criterion. The former
emphasizes making both test and training sets as representative as possible. The latter favors assignment of the
most unusual compounds to the training set to increase the statistical power of the models obtained.
Moreover, in the QSAR field, it is nowadays advised to carry out an external validation of previously
internally validated models at the model development step.
163,164,182
Together with training and test sets, an additional data set, called external data set, is advised to obtain an
unbiased estimate of the prediction ability of a model. The external data set is usually built as a further test set
using some deterministic algorithm to split the raw data. A single random splitting is not suggested because the
validation results would depend too strongly on the performed unique random splitting. Therefore, clustering
methods are commonly used, such as k-means, Jarvis-Patrick, hierarchical clustering methods, together with
more recent approaches based on Kohonen maps (or SOMs)
200
and sphere-exclusion algorithms.
190
Moreover,
D-optimal and distance-based optimal designs can also be efficiently used.
169,201,202
These techniques allow a partition of the objects by exploiting different similarity/diversity analyses,
spanning the whole chemical space and trying to perform the partition by a uniform covering. Then, the
objects are selected by these techniques in such a way that the training set objects are evenly distributed within
the whole chemical space and external set objects satisfy some condition of closeness to the training set objects.
A comparison among SOMs, Kennard-Stone design, D-optimal design, and random splitting was performed
by Wu et al.
203
The best models were built when Kennard-Stone and D-optimal designs were used; SOMs
resulted better than random selection, and D-optimal design was slightly better than the random selection.
While the test set is used during the optimization procedure, that is, searching for the best subset of model
variables or the best architecture of neural networks, the external set is only used to evaluate the predictive
performance of the final selected model(s).
The external Q
2
EXT
is among the most used parameters for evaluating the prediction ability for the external
set and is defined as
Q
2
EXT
= 1

n
TEST
i=1
y
i
y
i
( )
2

n
TEST
i=1
y
i
y
TR
( )
2
Q
2
EXT
_ 1 (44)
where the sums run over the external samples and the average response y
TR
is that obtained from the training
set samples.
167
The limiting condition to external validation is the total number of samples, being difficult to contemporarily
build a meaningful training set, a test set, and an external data set when not too many samples are available. In effect,
if very few chemicals are available, a model cannot be verified for its predictivity by checking only a few chemicals,
as in such cases the results could be obtained by chance, and it is impossible to derive general conclusions.
The role of external validation is more or less important depending on whether the model variables are
(1) already univocally defined or (2) selected among several candidate variables by using some selection
procedure. In effect, for case (1), if the external data set is chosen in such a way that its samples are similar to
those of training and test sets, the selected external samples depend on the whole data set, that is, they are
selected because they are in some way represented by other samples. Consequently, even if the samples of the
external data set do not participate in the model building, they are not completely independent of the training
samples, and thus, a good prediction ability can be reasonably expected.
In case (2), selection of external data set is based on the information given by all the candidate variables,
looking at the distribution of the compounds in the whole chemical space. External compounds are usually
selected to be similar to training compounds in the original chemical space, because by definition they must not
Chemometrics in QSAR 161
participate in the optimization procedure. However, the chemical space used to select external compounds is
obviously different from the chemical space associated to each model defined by a small number of selected
molecular descriptors, and, accordingly, similarity relationships among compounds may change significantly. In
other words, external compounds may differ from the training compounds within a specific model space, thus
resulting a useful tool to assess the general applicability of the model. In conclusion, external validation allows
to detect models lacking a sufficient generalizability.
On the contrary, external compounds that result uniformly distributed with the training compounds into the
whole chemical space could result outliers into the chemical space of the obtained specific models, and a low
prediction ability simply indicates that the external samples are not represented in the model chemical space.
External validation is not proposed as an alternative to internal validation but as an additional validation step
to be taken when models are obtained by variable selection procedures. In effect, the best models should be
selected by optimizing internal validation parameters, usually the LOO Q
2
. Then, only the good models, stable
and internally predictive, are subjected to external validation.
It is not unusual that models with high internal predictivity, verified by different internal validation methods,
but externally less predictive or even absolutely unpredictive, are present in the population of models
developed, for example, by a GA technique.
An example of this situation is highlighted in Table 6, which lists the first 30 models of a GA population of
PAH mutagenicity models (TA100 on 48 PAHs).
204
Table 6 GA population of models
a
for 48 Nitro-PAH mutagenicity (31 in training and 17 in prediction set), fitting (R
2
), internal
validation (Q
LOO
2
and Q
BOOT
2
) and external validation (Q
EXT
2
) parameters
ID Model descriptors R
2
Q
LOO
2
Q
BOOT
2
Q
EXT
2
1 PW2 SIC1 85.70 82.44 82.36 72.27
2 PW2 CIC1 84.88 80.78 80.71 75.34
3 X1A MATS1e 82.42 79.32 79.00 85.75
4 Mv MATS2e 83.37 79.04 79.25 84.27
5 Mv MATS1e 81.76 78.47 78.42 74.86
6 Mv GATS2m 81.57 77.87 78.10 69.13
7 GATS1e VED2 81.07 77.64 77.68 88.06
8 Xt nPyr 80.25 77.48 77.41 81.71
9 Mv PW2 80.95 77.39 77.97 71.85
10 PW2 IC1 80.89 77.04 77.32 60.07
11 JGI3 VED2 80.27 76.76 76.91 66.67
12 Mp LUMO 80.78 76.54 76.55 70.13
13 Mv LUMO 80.26 76.15 76.11 63.74
14 BELe8 HATS4u 80.53 76.10 76.17 47.59
15 IC1 VED2 80.17 76.09 76.55 80.94
16 Xt MATS1e 80.23 76.08 75.96 86.79
17 PW2 HIC 80.14 75.99 76.16 69.62
18 SIC1 VED2 79.92 75.78 76.11 81.65
19 VED2 Hy 79.55 75.52 75.63 86.98
20 VED2 R6u 79.27 75.52 75.50 27.18
21 HATS3u R3v 79.55 75.52 75.23 0
22 Mv MATS2m 79.25 75.37 75.64 69.21
23 Xt BELm2 79.89 75.35 75.40 69.54
24 GGI3 VED2 79.10 75.34 75.58 63.50
25 BELe8 R4u 80.06 75.32 75.30 50.23
26 SIC2 BEHm8 79.14 75.13 75.48 61.48
27 VED2 RTe 78.65 75.13 75.32 69.76
28 CIC2 VED2 79.49 75.06 75.08 77.75
29 SIC2 BELv5 79.40 75.02 75.36 58.31
30 X1A LUMO 79.13 74.96 74.91 78.98
a
In bold the models with reduced predictive performance in external validation in comparison to internal validation.
Reproduced from Gramatica, P.; Pilutti, P.; Papa, E. Approaches for Externally Validated QSAR Modelling of Nitrated
Polycyclic Aromatic Hydrocarbon Mutagenicity. SAR QSAR Environ. Res. 2007, 18, 169178.
162 Chemometrics in QSAR
Some models (in bold) appear stable and predictive by internal validation parameters (Q
2
and Q
2
BOOT
), but
are less predictive (or even unpredictive: Q
2
EXT
~ 0) when applied to external chemicals that were really never
presented to the GA during model development. It is also important to note that the less predictive models (in
bold) are based on different kinds of molecular descriptors; thus, model instability cannot be attributed to a
particular descriptor. The best combination of modeling variables must be chosen in this GA population from
the models, guaranteeing, first of all, a stable and internally predictive model and, additionally, externally
predictive ability.
Moreover, if the model is found by using some variable selection technique from a huge number of
potential candidate variables, high correlations with the modeled response can occur in the models purely by
chance.
205207
Therefore, it is very important to evaluate whether the models provided by some variable
selection tool are good only by chance.
A statistical tool able to detect the presence of chance correlation in a model and/or the lacking of model
robustness is the permutation test (also known as Y-scrambling or Y-randomization)
208
and its recent devel-
opments such as progressive scrambling
209
and other variants.
210
This validation technique consists of repeating
the calculation procedure with randomized responses and subsequent probability assessment of the resultant
statistics. Frequently, it is used along with cross-validation. It is expected that models obtained for the same data
but with randomized responses should have low values of the quality parameter (e.g., Q
2
or R
2
). However,
models based on the randomized responses which sometimes have high Q
2
(R
2
) values are rejected because of a
suspected chance correlation.
4.05.7.4 Model Descriptor Interpretability
Regarding the interpretability of the descriptors, it is important to take into account that modeled response is
frequently the result of a series of complex biological or physicochemical mechanisms; thus, it is very difficult
and reductionist to ascribe too much importance to the mechanistic meaning of the molecular descriptors used
in a QSAR model. Moreover, it must also be highlighted that in multivariate models such as MLR models, even
though the interpretation of the singular molecular descriptor can be certainly useful, it is only the combination
of the selected set of descriptors that is able to model the studied end-point. If the main aim of QSAR modeling
is to fill the gaps in available data, the modeler attention should be focused on model quality. In relation to this
point, Livingstone states:
211
The need for interpretability depends on the application, since a validated
mathematical model relating a target property to chemical features may, in some cases, be all that is necessary,
though it is obviously desirable to attempt some explanation of the mechanism in chemical terms, but it is
often not necessary, per se. Zefirov and Palyulin
175
took the same position, differentiating predictive QSARs,
where attention essentially concerns the best prediction quality, from descriptive QSARs where major attention
is paid to descriptor interpretability.
4.05.7.5 Summaries of QSAR Models
Several parameters are available for describing the QSAR model quality. This topic exceeds the scope of this
chapter, and then, only a simple example of necessary parameters for regression QSAR models is here given.
Typical regression QSAR models are usually reported (or should be reported) as in the following example,
where both R
2
and Q
2
are reported as percentages:
logP =3.1 0.15 ( ) 0.0056 0.0002 ( )X
1
12 1 ( )X
2
n = 15 Q
2
LOO
= 93.6 RMSEP = 0.792
R
2
= 97.7 RMSEC = 0.821
(45)
where X
1
and X
2
are two molecular descriptors, log P is the studied property, n is the number of training
samples, and the subscript LOO indicates that validation was performed by the LOO technique. The numbers
in the equation are the regression coefficients with their uncertainties.
Chemometrics in QSAR 163
Other useful parameters to be considered are the RMSEs (root mean square errors) calculated on training set
(also called SDEC or SEC) and test set (also called SDEP or SEP), representing the average errors in fitting and
in prediction, respectively, and defined as
SDECX RMSEC =

RSS
n
_
(46)
SDEPX RMSEP =

PRESS
n
_
(47)
Obviously, other estimates of the prediction power performed on an external test set (reporting also the number
of objects in the external set) or using LMO (together with the percentage of objects left out in each step) or
bootstrap techniques should be reported, together with other specific information about the adopted validation
techniques:
n = 15 Q
2
LMO
20% ( ) = 91.1 Q
2
LMO
30% ( ) = 90.3 Q
2
BOOT
= 90.6
n = 5 Q
2
EXT
= 88.0 RMSEP
EXT
= 0.872
(48)
4.05.8 Conclusions
The scientific community is showing more and more interest in the QSAR field. Several chemometric methods
were specifically conceived trying to solve QSAR problems, answering to the demand to know in a more and
more deep way chemical systems and their relationships with biological systems.
Several questions are still open and matters of debate, such as the problem of the validation strategies to
obtain predictive models, the interpretability of complex molecular descriptors, and the introduction of new
modeling tools.
Nowadays, the need to deal with biological systems described by peptide/protein or DNA sequences, to
describe proteomics maps, or to give effective answers to ecological and health problems pushes further toward
new borders where mathematics, statistics, chemistry, and biology and their interrelationships may produce
new effective useful knowledge.
References
1. Martin, Y. C. Advances in the Methodology of Quantitative Drug Design. In Drug Design, Vol. VIII; Arie ns, E. J., Ed.; Academic
Press: New York, NY, 1979; pp 172.
2. Kubinyi, H., Ed. 3D QSAR in Drug Design. Theory, Methods, and Applications; ESCOM: Leiden, The Netherlands, 1993; 760 pp.
3. Hansch, C.; Leo, A. Exploring QSAR. Fundamentals and Applications in Chemistry and Biology; American Chemical Society:
Washington, DC, 1995.
4. van de Waterbeemd, H.; Testa, B.; Folkers, G. Eds. Computer-Assisted Lead Finding and Optimization. Wiley-VCH: Weinheim,
Germany, 1997; 554 pp.
5. Devillers, J., Ed. Comparative QSAR; Taylor & Francis: Washington, DC, 1998; 371 pp.
6. Kubinyi, H.; Folkers, G.; Martin, Y. C., Eds. 3D QSAR in Drug Design Vol. 3; Kluwer/ESCOM: Dordrecht, The Netherlands, 1998;
352 pp.
7. Kubinyi, H.; Folkers, G.; Martin, Y. C., Eds. 3D QSAR in Drug Design Vol. 2; Kluwer/ESCOM: Dordrecht, The Netherlands, 1998;
416 pp.
8. Martin, Y. C. 3D QSAR: Current State Scope, and Limitations. In 3D QSAR in Drug Design; Kubinyi, H., Folkers, G., Martin, Y. C.,
Eds.; Kluwer/ESCOM: Dordrecht, The Netherlands, 1998; Vol. 3, pp 323.
9. Charton, M.; Charton, B. I. Advances in Quantitative StructureProperty Relationships; JAI Press: Amsterdam, The Netherlands,
2002; 228 pp.
10. Gasteiger, J. Handbook of Chemoinformatics. From Data to Knowledge in 4 Volumes; Wiley-VCH: Weinheim, Germany, 2003;
1870 pp.
11. Oprea, T. I. 3DQSAR Modeling in Drug Design. In Computational Medicinal Chemistry for Drug Discovery; Bultinck, P., De Winter,
H., Langenaeker, W., Tollenaere, J. P., Eds.; Marcel Dekker: New York, NY, 2004; pp 571616.
12. Crum-Brown, A. On the Theory of Isomeric Compounds. Trans. R. Soc. Edinb. 1864, 23, 707719.
13. Crum-Brown, A. On an Application of Mathematics to Chemistry. Proc. R. Soc. (Edinb.) 1866, VI (73), 8990.
164 Chemometrics in QSAR
14. Crum-Brown, A.; Fraser, T. R. On the Connection between Chemical Constitution and Physiological Action. Part 1. On the
Physiological Action of Salts of the Ammonium Bases, Derived from Strychnia, Brucia, Thebia, Codeia, Morphia and Nicotia.
Trans. R. Soc. Edinb. 1868, 25, 151203.
15. Ko rner, W. Studi sulla Isomeria delle Cos` Dette Sostanze Aromatiche a Sei Atomi di Carbonio. Gazz. Chim. Ital. 1874, 4, 242.
16. Mills, E. J. On Melting Point and Boiling Point as Related to Composition. Philos. Mag. 1884, 17, 173187.
17. Richet, M. C. Note sur la Rapport entre la Toxicite et les Proprie te s Physiques des Corps. Compt. Rend. Soc. Biol. (Paris) 1893,
45, 775776.
18. Meyer, H. Zur Theorie der Alkoholnarkose. Arch. Exp. Pathol. Pharmacol. 1899, 42, 109118.
19. Overton, E. Studien u ber die Narkose, zugleich ein Beitrag zur allgemeinen Pharmakologie; Verlag Gustav Fischer: Jena, Germany
1901; 141 pp.
20. Traube, I. Theorie der Osmose und Narkose. Arch. fu r die ges. Physiol. 1904, 105, 541558.
21. Wiener, H. Influence of Interatomic Forces on Paraffin Properties. J. Chem. Phys. 1947, 15, 766.
22. Platt, J. R. Influence of Neighbor Bonds on Additive Bond Properties in Paraffins. J. Chem. Phys. 1947, 15, 419420.
23. Fujita, T.; Iwasa, J.; Hansch, C. A New Substituent Constant, , Derived from Partition Coefficients. J. Am. Chem. Soc. 1964, 86,
51755180.
24. Gordon, M.; Scantlebury, G. R. Non-RandomPolycondensation: Statistical Theory of the Substitution Effect. Trans. Faraday Soc.
1964, 60, 604621.
25. Smolenskii, E. A. Application of the Theory of Graphs to Calculations of the Additive Structural Properties of Hydrocarbons. Russ.
J. Phys. Chem. 1964, 38, 700702.
26. Spialter, L. The Atom Connectivity Matrix (ACM) and Its Characteristic Polynomial (ACMCP). J. Chem. Doc. 1964, 4, 261269.
27. Balaban, A. T.; Harary, F. Chemical Graphs. V. Enumeration and Proposed Nomenclature of Benzenoid Catacondensed
Polycyclic Aromatic Hydrocarbons. Tetrahedron 1968, 24, 25052516.
28. Harary, F. Graph Theory; Addison-Wesley: Reading, MA, 1969.
29. Kier, L. B. Molecular Orbital Theory in Drug Research; Academic Press: New York, NY, 1971.
30. Cammarata, A. Interrelationship of the Regression Models Used for StructureActivity Analyses. J. Med. Chem. 1972, 15, 573577.
31. Gutman, I.; Trinajstic, N. Graph Theory and Molecular Orbitals. Total -Electron Energy of Alternant Hydrocarbons. Chem. Phys.
Lett. 1972, 17, 535538.
32. Hosoya, H. Topological Index as a Sorting Device for Coding Chemical Structures. J. Chem. Doc. 1972, 12, 181183.
33. Pauling, L. The Additivity of the Energies of Normal Covalent Bonds. Proc. Natl. Acad. Sci. USA 1932, 14, 414416.
34. Pauling, L. The Nature of the Chemical Bond; Cornell University Press: Ithaca, NY, 1939.
35. Coulson, C. A. The Electronic Structure of Some Polyenes and Aromatic Molecules. VII. Bonds of Fractional Order by the
Molecular Orbital Method. Proc. R. Soc. London A 1939, 169, 413428.
36. Sanderson, R. T. Electronegativity I. Orbital Electronegativity of Neutral Atoms. J. Chem. Educ. 1952, 29, 540546.
37. Fukui, K.; Yonezawa, Y.; Shingu, H. Theory of Substitution in Conjugated Molecules. Bull. Chem. Soc. Jpn. 1954, 27, 423427.
38. Mulliken, R. S. Electronic Population Analysis on LCAO-MO Molecular Wave Functions. I. J. Chem. Phys. 1955, 23, 18331840.
39. Hammett, L. P. Reaction Rates and Indicator Acidities. Chem. Rev. 1935, 17, 6779.
40. Hammett, L. P. The Effect of Structure upon the Reactions of Organic Compounds. Benzene Derivatives. J. Am. Chem. Soc.
1937, 59, 96103.
41. Hammett, L. P. Linear Free Energy Relationships in Rate and Equilibrium Phenomena. Trans. Faraday Soc. 1938, 34, 156165.
42. Taft, R. W. Polar and Steric Substituent Constants for Aliphatic and o-Benzoate Groups from Rates of Esterification and
Hydrolysis of Esters. J. Am. Chem. Soc. 1952, 74, 31203128.
43. Taft, R. W. The General Nature of the Proportionality of Polar Effects of Substituent Groups in Organic Chemistry. J. Am. Chem.
Soc. 1953, 75, 42314238.
44. Taft, R. W. Linear Steric Energy Relationships. J. Am. Chem. Soc. 1953, 75, 45384539.
45. Hansch, C.; Maloney, P. P.; Fujita, T.; Muir, R. M. Correlation of Biological Activity of Phenoxyacetic Acids with Hammett
Substituent Constants and Partition Coefficients. Nature 1962, 194, 178180.
46. Hansch, C.; Muir, R. M.; Fujita, T.; Maloney, P. P.; Geiger, F.; Streich, M. The Correlation of Biological Activity of Plant Growth
Regulators and Chloromycetin Derivatives with Hammett Constants and Partition Coefficients. J. Am. Chem. Soc. 1963, 85,
28172824.
47. Free, S. M.; Wilson, J. W. A Mathematical Contribution to StructureActivity Studies. J. Med. Chem. 1964, 7, 395399.
48. Kubinyi, H. Free Wilson Analysis. Theory, Applications and Its Relationship to Hansch Analysis. Quant. Struct. -Act. Relat. 1988,
7, 121133.
49. Balaban, A. T.; Harary, F. The Characteristic Polynomial Does Not Uniquely Determine the Topology of a Molecule. J. Chem. Doc.
1971, 11, 258259.
50. Balaban, A. T. Ed. Chemical Applications of Graph Theory; Academic Press: New York, NY, 1976; 390 pp.
51. Randic, M. On the Recognition of Identical Graphs Representing Molecular Topology. J. Chem. Phys. 1974, 60, 39203928.
52. Randic, M. On Characterization of Molecular Branching. J. Am. Chem. Soc. 1975, 97, 66096615.
53. Kier, L. B.; Hall, L. H.; Murray, W. J.; Randic, M. Molecular Connectivity. I: Relationship to Nonspecific Local Anesthesia. J.
Pharm. Sci. 1975, 64, 19711974.
54. Rohrbaugh, R. H.; Jurs, P. C. Descriptions of Molecular Shape Applied in Studies of Structure/Activity and Structure/Property
Relationships. Anal. Chim. Acta 1987, 199, 99109.
55. Stanton, D. T.; Jurs, P. C. Development and Use of Charged Partial Surface Area Structural Descriptors in Computer-Assisted
Quantitative StructureProperty Relationship Studies. Anal. Chem. 1990, 62, 23232329.
56. Todeschini, R.; Lasagni, M.; Marengo, E. New Molecular Descriptors for 2D- and 3D-Structures, Theory. J. Chemom. 1994, 8,
263273.
57. Katritzky, A. R.; Mu, L.; Lobanov, V. S.; Karelson, M. Correlation of Boiling Points with Molecular Structure. 1. A Training Set of
298 Diverse Organics and a Test Set of 9 Simple Inorganics. J. Phys. Chem. 1996, 100, 1040010407.
58. Ferguson, A. M.; Heritage, T. W.; Jonathon, P.; Pack, S. E.; Phillips, L.; Rogan, J.; Snaith, P. J. EVA: A New Theoretically Based
Molecular Descriptor for Use in QSAR/QSPR Analysis. J. Comput. Aided Mol. Des. 1997, 11, 143152.
Chemometrics in QSAR 165
59. Schuur, J.; Selzer, P.; Gasteiger, J. The Coding of the Three-Dimensional Structure of Molecules by Molecular Transforms
and Its Application to Structure-Spectra Correlations and Studies of Biological Activity. J. Chem. Inf. Comput. Sci. 1996, 36,
334344.
60. Tuppurainen, K. EEVA (Electronic Eigenvalue): A New QSAR/QSPR Descriptor for Electronic Substituent Effects Based on
Molecular Orbital Energies. SAR QSAR Environ. Res. 1999, 10, 3946.
61. Consonni, V.; Todeschini, R.; Pavan, M. Structure/Response Correlations and Similarity/Diversity Analysis by GETAWAY
Descriptors. Part 1. Theory of the Novel 3D Molecular Descriptors. J. Chem. Inf. Comput. Sci. 2002, 42, 682692.
62. Goodford, P. J. A Computational Procedure for Determining Energetically Favorable Binding Sites on Biologically Important
Macromolecules. J. Med. Chem. 1985, 28, 849857.
63. Cramer, R. D. III; Patterson, D. E.; Bunce, J. D. Comparative Molecular Field Analysis (CoMFA). 1. Effect of Shape on Binding of
Steroids to Carrier Proteins. J. Am. Chem. Soc. 1988, 110, 59595967.
64. Klebe, G.; Abraham, U.; Mietzner, T. Molecular Similarity Indices in a Comparative Analysis (CoMSIA) of Drug Molecules to
Correlate and Predict Their Biological Activity. J. Med. Chem. 1994, 37, 41304146.
65. Jain, A. N.; Koile, K.; Chapman, D. Compass: Predicting Biological Activities from Molecular Surface Properties. Performance
Comparisons on a Steroid Benchmark. J. Med. Chem. 1994, 37, 23152327.
66. Todeschini, R.; Moro, G.; Boggia, R.; Bonati, L.; Cosentino, U.; Lasagni, M.; Pitea, D. Modeling and Prediction of Molecular
Properties. Theory of Grid-Weighted Holistic Invariant Molecular (G-WHIM) Descriptors. Chemom. Intell. Lab. Syst. 1997, 36,
6573.
67. Chuman, H.; Karasawa, M.; Fujita, T. A Novel 3-Dimensional QSAR Procedure Voronoi Field Analysis. Quant. Struct. -Act. Relat.
1998, 17, 313326.
68. Cruciani, G.; Pastor, M.; Guba, W. VolSurf: A New Tool for the Pharmaceutic Optimization of Lead Compounds. Eur. J. Pharm.
Sci. 2000, 11 (Suppl.), S29S39.
69. Pastor, M.; Cruciani, G.; McLay, I. M.; Pickett, S. D.; Clementi, S. GRid-INdependent Descriptors (GRIND): A Novel Class of
Alignment-Independent Three-Dimensional Molecular Descriptors. J. Med. Chem. 2000, 43, 32333243.
70. Kubinyi, H. QSAR in Drug Design. In Handbook of Chemoinformatics; Gasteiger, J., Ed.; Wiley-VCH: Weinheim, Germany, 2003;
Vol. 4, pp 15321554.
71. Kohonen, T. Self-Organization and Associative Memory; Springer: Berlin, Germany, 1989.
72. Zupan, J.; Novic, M.; Gasteiger, J. Neural Networks with Counter-Propagation Learning Strategy Used for Modelling. Chemom.
Intell. Lab. Syst. 1995, 27, 175187.
73. Livingstone, D. J.; Salt, D. W. Regression Analysis for QSAR Using Neural Networks. Bioorg. Med. Chem. Lett. 1992, 2, 213218.
74. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 532.
75. Svetnik, V.; Liaw, A.; Tong, C.; Culberson, C.; Sheridan, R. P.; Feuston, B. P. Random Forest: A Classification and Regression
Tool for Compound Classification and QSAR Modeling. J. Chem. Inf. Comput. Sci. 2003, 43, 19471958.
76. Vapnik, V. The Nature of Statistical Learning Theory; Springer: New York, 1995.
77. Worth, A. P.; Cronin, M. T. D. Embedded Cluster Modelling A Novel Method for Analysing Embedded Data Sets. Quant. Struct. -
Act. Relat. 1999, 18, 229235.
78. Todeschini, R.; Ballabio, D.; Consonni, V.; Mauri, A.; Pavan, M. CAIMAN (Classification and Influence Matrix Analysis): A New
Classification Method Based on Leverage-Scaled Functions. Chemom. Intell. Lab. Syst. 2007, 87, 317.
79. Sabljic, A. Predictions of the Nature and Strength of Soil Sorption of Organic Pollutants by Molecular Topology. J. Agric. Food
Chem. 1984, 32, 243246.
80. Halfon, E.; Galassi, S.; Bru ggemann, R.; Provini, A. Selection of Priority Properties to Assess Environmental Hazard of Pesticides.
Chemosphere 1996, 33, 15431562.
81. Bru ggemann, R.; Pudenz, S.; Carlsen, L.; Srensen, P. B.; Thomsen, M.; Mishra, R. K. The Use of Hasse Diagrams as a Potential
Approach for Inverse QSAR. SAR QSAR Environ. Res. 2001, 11, 473487.
82. Pavan, M.; Mauri, A.; Todeschini, R. Total Ranking Models by the Genetic Algorithms Variable Subset Selection (GA-VSS)
Approach for Environmental Priority Settings. Anal. Bioanal. Chem. 2004, 380, 430444.
83. Pavan, M.; Consonni, V.; Todeschini, R. Partial Ranking Models by Genetic Algorithms Variable Subset Selection (GA-VSS)
Approach for Environmental Priority Settings. MATCH Commun. Math. Comput. Chem. 2005, 54, 583609.
84. Gordeeva, E. V.; Molchanova, M. S.; Zefirov, N. S. General Methodology and Computer Program for the Exhaustive Restoring of
Chemical Structures by Molecular Connectivity Indices. Solution of the Inverse Problem in QSAR/QSPR. Tetrahedron Comput.
Method. 1990, 3, 389415.
85. Zefirov, N.; Palyulin, V. A.; Skvortsova, M. I.; Baskin, I. I. Inverse Problems in QSAR. In QSAR and Molecular Modelling: Concepts,
Computational Tools and Biological Applications; Sanz, F., Giraldo, J.; Manaut, F., Eds.; Prous Science: Barcelona, Spain, 1995;
pp 4041.
86. Tarko, L.; Ivanciuc, O. QSAR Modeling of the Anticolvulsant Activity of Phylacetanilides with PRECLAV (Property Evaluation by
Class Variables). MATCH Commun. Math. Comput. Chem. 2001, 44, 201214.
87. Kamlet, M. J.; Abboud, J.-L. M.; Taft, R. W. An Examination of Linear Solvation Energy Relationships. Prog. Phys. Org. Chem.
1981, 13, 485630.
88. Kamlet, M. J.; Doherty, P. J.; Taft, R. W.; Abraham, M. H.; Veith, G. D.; Abraham, D. J. Solubility Properties in Polymers and
Biological Media. 8. An Analysis of the Factors that Influence Toxicities of Organic Nonelectrolytes to the Golden Orfe Fish
(Leuciscus idus melanotus). Environ. Sci. Technol. 1987, 21, 149155.
89. Kamlet, M. J.; Doherty, R. M.; Abboud, J.-L. M.; Abraham, M. H.; Taft, R. W. Solubility. A New Look. Chemtech 1986, 16, 566576.
90. Kamlet, M. J.; Abraham, M. H.; Doherty, R. M.; Taft, R. W. Solubility Properties in Polymers and Biological Media. 4. Correlations
of Octanol/Water Partition Coefficients with Solvatochromic Parameters. J. Am. Chem. Soc. 1984, 106, 464466.
91. Kamlet, M. J.; Doherty, R. M.; Carr, P. W.; Mackay, D.; Abraham, M. H.; Taft, R. W. Linear Solvation Energy Relationships. 44.
Parameter Estimation Rules that Allow Accurate Prediction of Octanol/Water Partition Coefficients and Other Solubility and
Toxicity Properties of Polychlorinated Biphenyls and Polycyclic Aromatic Hydrocarbons. Environ. Sci. Technol. 1988, 22,
503509.
166 Chemometrics in QSAR
92. Abraham, M. H.; Ibrahim, A.; Acree, W. E. Jr. Air to Blood Distribution of Volatile Organic Compounds: A Linear Free Energy
Analysis. Chem. Res. Toxicol. 2005, 18, 904911.
93. Reinhard, M.; Drefahl, A. Handbook for Estimating Physicochemical Properties of Organic Compounds; Wiley: New York, NY,
228 pp.
94. Nys, G. G.; Rekker, R. F. Statistical Analysis of a Series of Partition Coefficients with Special Reference to the Predictability of
Folding of Drug Molecules. The Introduction of Hydrophobic Fragmental Constants (f Values). Eur. J. Med. Chem. 1973, 8,
521535.
95. Broto, P.; Moreau, G.; Vandycke, C. Molecular Structures: Perception, Autocorrelation Descriptor and SAR Studies. System of
Atomic Contributions for the Calculation of the n-Octane/Water Partition Coefficients. Eur. J. Med. Chem. 1984, 19, 7178.
96. Ghose, A. K.; Crippen, G. M. Atomic Physicochemical Parameters for Three-Dimensional-Structure-Directed Quantitative
StructureActivity Relationships. I. Partition Coefficients as a Measure of Hydrophobicity. J. Comput. Chem. 1986, 7, 565577.
97. Moriguchi, I.; Hirono, S.; Liu, Q.; Nakagome, I.; Matsushita, Y. Simple Method of Calculating Octanol/Water Partition Coefficient.
Chem. Pharm. Bull. 1992, 40, 127130.
98. Klopman, G.; Li, J. Y.; Wang, S.; Dimayuga, M. Computer Automated log P Calculations Based on an Extended Group
Contribution Approach. J. Chem. Inf. Comput. Sci. 1994, 34, 752781.
99. Wang, S.; Milne, G. W. A.; Klopman, G. Graph Theory and Group Contributions in the Estimation of Boiling Points. J. Chem. Inf.
Comput. Sci. 1994, 34, 12421250.
100. Krzyzaniak, J. F.; Myrdal, P. B.; Simamora, P.; Yalkowsky, S. H. Boiling Point and Melting Point Prediction for Aliphatic, Non-
Hydrogen-Bonding Compounds. Ind. Eng. Chem. Res. 1995, 34, 25302535.
101. Ghose, A. K.; Crippen, G. M. Atomic Physicochemical Parameters for Three-Dimensional-Structure-Directed Quantitative
StructureActivity Relationships. 2. Modeling Dispersive and Hydrophobic Interactions. J. Chem. Inf. Comput. Sci. 1987, 27,
2135.
102. Perrin, D. D.; Dempsey, B.; Serjeant, E. P. pKa Prediction for Organic Acids and Bases; Chapman & Hall: London, UK, 1981.
103. Klopman, G.; Wang, S.; Balthasar, D. M. Estimation of Aqueous Solubility of Organic Molecules by the Group Contribution
Approach. Application to the Study of Biodegradation. J. Chem. Inf. Comput. Sci. 1992, 32, 474482.
104. Tao, S.; Piao, H.; Dawson, R.; Lu, X.; Hu, H. Estimation of Organic Carbon Normalized Sorption Coefficient (K
OC
) for Soils Using
the Fragment Constant Method. Environ. Sci. Technol. 1999, 33, 27192725.
105. Yoneda, Y. An Estimation of the Thermodynamic Properties of Organic Compounds in the Ideal Gas State. I. Acyclic
Compounds and Cyclic Compounds with a Ring of Cyclopentane, Cyclohexane, Benzene or Naphthalene. Bull. Chem. Soc.
Jpn. 1979, 52, 12971314.
106. Reid, R. C.; Prausnitz, J. M.; Poling, B. E. The Properties of Gases and Liquids; McGraw-Hill: New York, NY, 1988.
107. Atkinson, R. A StructureActivity Relationships for the Estimation of Rate Constants for the Gas-Phase Reactions of OH
Radicals with Organic Compounds. Int. J. Chem. Kinet. 1987, 19, 799828.
108. Ertl, P.; Rohde, B.; Selzer, P. Fast Calculation of Molecular Polar Surface Area as a Sum of Fragment-Based Contributions and
Its Application to the Prediction of Drug Transport Properties. J. Med. Chem. 2000, 43, 37143717.
109. McFarland, J. W.; Gans, D. J. Cluster Significance Analysis: A New QSAR Tool for Asymmeric Data Sets. Drug Inf. J. 1990, 24,
705711.
110. Rose, V. S.; Wood, J. Generalized Cluster Significance Analysis and Stepwise Cluster Significance Analysis with Conditional
Probabilities. Quant. Struct. -Act. Relat. 1998, 17, 348356.
111. Worth, A. P.; Bassan, A.; Fabjan, E.; Gallegos Saliner, A.; Netzeva, T. I.; Patlewicz, G.; Pavan, M.; Tsakovska, I. The Use of
Computational Methods in the Grouping and Assessment of Chemicals Preliminary Investigations. Eur. Tech. Rep. 2008, in
press.
112. Todeschini, R.; Consonni, V. Handbook of Molecular Descriptors; Wiley-VCH: Weinheim, Germany, 2000; 668 pp.
113. Randic, M. Molecular Bonding Profiles. J. Math. Chem. 1996, 19, 375392.
114. ADAPT. Jurs, P.C., Pensilvania State University (PN).
115. Mekenyan, O.; Karabunarliev, S.; Bonchev, D. The OASIS Concept for Predicting Biological Activity of Chemical Compounds.
J. Math. Chem. 1990, 4, 207215.
116. CODESSA Reference Manual 2.0. Katritzky, A.R.; Lobanov, V.S.; Karelson, M., Gainsville (FL).
117. MolConn-Z: A Program for Molecular Topology Analysis 3. Hall Associates Consulting, Quincy (MA).
118. DRAGON (Software for molecular descriptor calculations) 5.5. Talete s.r.l., Via V.Pisani 13, Milano (Italy).
119. Testa, B.; Kier, L. B. The Concept of Molecular Structure in StructureActivity Relationship Studies and Drug Design. Med. Res.
Rev. 1991, 11, 3548.
120. Jurs, P. C.; Dixon, J. S.; Egolf, L. M. Representations of Molecules. In Chemometrics Methods in Molecular Design; van de
Waterbeemd, H., Ed.; VCH Publishers: New York, NY, 1995; Vol. 2, pp 1538.
121. Smith, E. G.; Baker, P. A. The Wiswesser Line-Formula Chemical Notation (WLN); Chemical Information Management: Cherry
Hill, NJ, 1975.
122. Weininger, D. SMILES, a Chemical Language and Information System. 1. Introduction to Methodology and Encoding Rules.
J. Chem. Inf. Comput. Sci. 1988, 28, 3136.
123. Mekenyan, O.; Ivanov, J.; Veith, G. D.; Bradbury, S. P. Dynamic QSAR: A New Search for Active Conformations and Significant
Stereoelectronic Indices. Quant. Struct. -Act. Relat. 1994, 13, 302307.
124. Mekenyan, O.; Nikolova, N.; Schmieder, P. Dynamic 3D QSAR Techniques: Applications in Toxicology. J. Mol. Struct.
(Theochem) 2003, 622, 147165.
125. Basak, S. C.; Gute, B. D.; Grunwald, G. D. Use of Topostructural, Topochemical, and Geometric Parameters in the Prediction of
Vapor Pressure: A Hierarchical QSAR Approach. J. Chem. Inf. Comput. Sci. 1997, 37, 651655.
126. Hosoya, H. Topological Index. A Newly Proposed Quantity Characterizing the Topological Nature of Structural Isomers of
Saturated Hydrocarbons. Bull. Chem. Soc. Jpn. 1971, 44, 23322339.
127. Randic, M.; Wilkins, C. L. Graph Theoretical Ordering of Structures as a Basis for Systematic Searches for Regularities in
Molecular Data. J. Phys. Chem. 1979, 83, 15251540.
128. Kier, L. B. A Shape Index from Molecular Graphs. Quant. Struct. -Act. Relat. 1985, 4, 109116.
Chemometrics in QSAR 167
129. Randic, M. Novel Shape Descriptors for Molecular Graphs. J. Chem. Inf. Comput. Sci. 2001, 41, 607613.
130. Wiener, H. Structural Determination of Paraffin Boiling Points. J. Am. Chem. Soc. 1947, 69, 1720.
131. Ivanciuc, O.; Balaban, A. T. The Graph Description of Chemical Structures. In Topological Indices and Related Descriptors in
QSAR and QSPR; Devillers, J., Balaban, A. T., Eds.; Gordon and Breach Science Publishers: Amsterdam, The Netherlands,
1999; pp 59167.
132. Ivanciuc, O.; Balaban, T.-S.; Balaban, A. T. Design of Topological Indices. Part 4. Reciprocal Distance Matrix, Related Local
Vertex Invariants and Topological Indices. J. Math. Chem. 1993, 12, 309318.
133. Janez ic, D.; Milicevic, A.; Nikolic, S.; Trinajstic, N. Graph Theoretical Matrices in Chemistry; University of Kragujevac:
Kragujevac, Serbia, 2007; 205 pp.
134. Randic, M. Graph Theoretical Approach to Local and Overall Aromaticity of Benzenoid Hydrocarbons. Tetrahedron 1975, 31,
14771481.
135. Kier, L. B.; Hall, L. H. The Nature of StructureActivity Relationships and Their Relation to Molecular Connectivity. Eur. J. Med.
Chem. 1977, 12, 307312.
136. Balaban, A. T. Highly Discriminating Distance-Based Topological Index. Chem. Phys. Lett. 1982, 89, 399404.
137. Burden, F. R. A Chemically Intuitive Molecular Index Based on the Eigenvalues of a Modified Adjacency Matrix. Quant. Struct.
Act. Relat. 1997, 16, 309314.
138. Raevsky, O. A.; Trepalin, S. V.; Razdolskii, A. N. New QSAR Descriptors Calculated from Interatomic Interaction Spectra.
Pharm. Chem. J. 2000, 34, 646649.
139. Robinson, D. D.; Winn, P. J.; Lyne, P. D.; Richards, W. G. Self-Organizing Molecular Field Analysis: A Tool for StructureActivity
Studies. J. Med. Chem. 1999, 42, 573583.
140. Buolamwini, J. K.; Assefa, H. CoMFA and CoMSIA 3D QSAR and Docking Studies on Conformationally-Restrained Cinnamoyl
HIV-1 Integrase Inhibitors: Exploration of a Binding Mode at the Active Site. J. Med. Chem. 2002, 45, 841852.
141. Xu, M.; Zhang, A.; Han, S.; Wang, L.-S. Studies of 3D-Quantitative StructureActivity Relationships on a Set of Nitroaromatic
Compounds: CoMFA, Advanced CoMFA and CoMSIA. Chemosphere 2002, 48, 707715.
142. Jolliffe, I. T. Discarding Variables in a Principal Component Analysis. I. Artificial Data. Appl. Stat. 1972, 21, 160173.
143. Jolliffe, I. T. Discarding Variables in a Principal Component Analysis. II. Real Data. Appl. Stat. 1973, 22, 2131.
144. Todeschini, R. Data Correlation, Number of Significant Principal Components and Shape of Molecules. The K Correlation Index.
Anal. Chim. Acta 1997, 348, 419430.
145. Todeschini, R.; Consonni, V.; Maiocchi, A. The K Correlation Index: Theory Development and Its Applications in Chemometrics.
Chemom. Intell. Lab. Syst. 1998, 46, 1329.
146. Efroymson, M. A. Multiple Regression Analysis. In Mathematical Methods for Digital Computers; Ralston, A., Wilf, H. S., Eds.;
Wiley: New York, NY, 1960.
147. Leardi, R. Application of Genetic Algorithms to Feature Selection under Full Validation Conditions and to Outlier Detection.
J. Chemom. 1994, 8, 6579.
148. Luke, B. T. Evolutionary Programming Applied to the Development of Quantitative StructureActivity Relationships and
Quantitative StructureProperty Relationships. J. Chem. Inf. Comput. Sci. 1994, 34, 12791287.
149. Zheng, W.; Tropsha, A. Novel Variable Selection Quantitative StructureProperty Relationship Approach Based on the
k-Nearest-Neighbor Principle. J. Chem. Inf. Comput. Sci. 2000, 40, 185194.
150. Baumann, K.; Albert, H.; von Korff, M. A Systematic Evaluation of the Benefits and Hazards of Variable Selection in Latent
Variable Regression. Part I. Search Algorithm, Theory and Simulations. J. Chemom. 2002, 16, 339350.
151. Kubinyi, H. Variable Selection in QSAR Studies. I. An Evolutionary Algorithm. Quant. Struct. -Act. Relat. 1994, 13,
285294.
152. Agrafiotis, D. K.; Ceden o, W.; Lobanov, V. S. On the Use of Neural Network Ensembles in QSAR and QSPR. J. Chem. Inf.
Comput. Sci. 2002, 42, 903911.
153. Ceden o, W.; Agrafiotis, D. K. Using Particle Swarms for the Development of QSAR Models Based on K-Nearest Neighbor and
Kernel Regression. J. Comput. Aided Mol. Des. 2003, 17, 255263.
154. Lin, Z. H.; Xingguo, C.; Zhide, H. A New Approach for the Identification of Important Variables. Chemom. Intell. Lab. Syst. 2006,
80, 130135.
155. Lindgren, F.; Geladi, P.; Ra nnar, S.; Wold, S. Interactive Variable Selection (IVS) for PLS. Part I: Theory and Algorithms.
J. Chemom. 1994, 8, 349363.
156. Lindgren, F.; Geladi, P.; Berglund, A.; Sjo stro m, M.; Wold, S. Interactive Variable Selection (IVS) for PLS. Part II: Chemical
Applications. J. Chemom. 1995, 9, 331342.
157. Centner, V.; Massart, D. L.; de Noord, O. E.; De Jong, S.; Vandeginste, B. G. M.; Sterna, C. Elimination of Uniformative Variables
for Multivariate Calibration. Anal. Chem. 1996, 68, 38513858.
158. Sutter, J. M.; Peterson, T. A.; Jurs, P. C. Prediction of Gas Chromatographic Retention Indices of Alkylbenzene. Anal. Chim.
Acta 1997, 342, 113122.
159. Akaike, H. A New Look at the Statistical Model Identification. IEEE Trans. Automat. Contr. 1974, AC-19, 716723.
160. Friedman, J. H. Multivariate Adaptive Regression Splines; Report; Laboratory of Computational Statistics Department of
Statistics: Stanford, CA.
161. Kubinyi, H. Evolutionary Variable Selection in Regression and PLS Analyses. J. Chemom. 1996, 10, 119133.
162. Todeschini, R.; Consonni, V.; Mauri, A.; Pavan, M. Detecting Bad Regression Models: Multicriteria Fitness Functions in
Regression Analysis. Anal. Chim. Acta 2004, 515, 199208.
163. Golbraikh, A.; Tropsha, A. Beware of q
2!
. J. Mol. Graph. Model. 2002, 20, 269276.
164. Tropsha, A.; Gramatica, P.; Gombar, V. K. The Importance of Being Earnest: Validation Is the Absolute Essential for Successful
Application and Interpretation of QSPR Models. QSAR Comb. Sci. 2003, 22, 6977.
165. Sutherland, J. J.; Weaver, D. F. Development of Quantitative StructureActivity Relationships and Classification Models for
Anticonvulsant Activity of Hydantoin Analogues. J. Chem. Inf. Comput. Sci. 2003, 43, 10281036.
166. van Rhee, A. M. Use of Recursion Forest in the Sequential Screening Process: Consensus Selection by Multiple Recursion
Trees. J. Chem. Inf. Model. 2003, 43, 941948.
168 Chemometrics in QSAR
167. Todeschini, R.; Consonni, V.; Mauri, A.; Pavan, M. MOBYDIGS: Software for Regression and Classification Models by Genetic
Algorithms. In Chemometrics: Genetic Algorithms and Artificial Neural Networks; Leardi, R., Ed.; Elsevier: Amsterdam, The
Netherlands, 2003; pp 141167.
168. Todeschini, R.; Consonni, V.; Pavan, M. A Distance Measure between Models: A Tool for Similarity/Diversity Analsysis of Model
Populations. Chemom. Intell. Lab. Syst. 2004, 70, 5561.
169. Gramatica, P.; Pilutti, P.; Papa, E. Validated QSAR Prediction of OH Tropospheric Degradation of VOCs: Splitting into Training-
Test Sets and Consensus Modeling. J. Chem. Inf. Comput. Sci. 2004, 44, 17941802.
170. Asikainen, A. H.; Ruuskanen, J.; Tuppurainen, K. A. Consensus kNN QSAR: A Versatile Method for Predicting the Estrogenic
Activity of Organic Compounds In Silico. A Comparative Study with Five Estrogen Receptors and a Large, Diverse Set of
Ligands. Environ. Sci. Technol. 2004, 38, 67246729.
171. Baurin, N.; Mozziconacci, J. C.; Arnoult, E.; Chavatte, P.; Marot, C.; Morin-Allory, L. 2D QSAR Consensus Prediction for High-
Throughput Virtual Screening. An Application to COX-2 Inhibition Modeling and Screening of the NCI Database. J. Chem. Inf.
Comput. Sci. 2004, 44, 276285.
172. Gramatica, P.; Giani, E.; Papa, E. Statistical External Validation and Consensus Modeling: A QSPR Case Study for K
oc
Prediction. J. Mol. Graph. Model. 2007, 25, 755766.
173. Votano, J. R.; Parham, M.; Hall, L. H.; Kier, L. B.; Oloff, S.; Tropsha, A.; Xie, Q.; Tong, W. Three New Consensus QSAR Models
for the Prediction of Ames Genotoxicity. Mutagenesis 2004, 19, 365377.
174. Eriksson, L.; Jaworska, J. S.; Worth, A. P.; Cronin, M. T. D.; McDowell, R. M.; Gramatica, P. Methods for Reliability, Uncertainty
Assessment, and Applicability Evaluations of Regression Based and Classification QSARs. Environ. Health Perspect. 2003, 111,
13611375.
175. Zefirov, N. S.; Palyulin, V. A. QSAR for Boiling Points of Small Sulfides. Are the High-Quality Structure-Property-Activity
Regressions the Real High Quality QSAR Models? J. Chem. Inf. Comput. Sci. 2001, 41, 10221027.
176.. Jaworska, J. S.; Nikolova-Jeliazkova, N.; Aldenberg, T. Review of Methods for Applicability Domain Estimation; Report; The
European Commission Joint Research Centre: Ispra, Italy.
177. Dimitrov, S.; Dimitrova, G.; Pavlov, T.; Dimitrova, N.; Patlewicz, G.; Niemela, J.; Mekenyan, O. A Stepwise Approach for Defining
the Applicability Domain of SAR and QSAR Models. J. Chem. Inf. Model. 2005, 45, 839849.
178. Jaworska, J. S.; Nikolova-Jeliazkova, N.; Aldenberg, T. QSAR Applicability Domain Estimation by Projection of the Training Set
in Descriptor Space: A Review. ATLA 2005, 33, 445459.
179. Netzeva, T. I.; Worth, A. P.; Aldenberg, T.; Benigni, R.; Cronin, M. T. D.; Gramatica, P.; Jaworska, J. S.; Kahn, S.; Klopman, G.;
Marchant, C. A.; Myatt, G.; Nikolova-Jeliazkova, N.; Patlewicz, G.; Perkins, R.; Roberts, D. W.; Schultz, T. W.; Stanton, D. T.; van
de Sandt, J. J. M.; Tong, W. D.; Veith, G. D.; Yang, C. H. Current Status of Methods for Defining the Applicability Domain of
(Quantitative) StructureActivity Relationships. ATLA 2005, 33, 155173.
180. Nikolova-Jeliazkova, N.; Jaworska, J. S. An Approach to Determining Applicability Domains for QSAR Group Contribution
Models: An Analysis of SRC KOWWIN. ATLA 2005, 33, 461470.
181. Tetko, I. V.; Bruneau, P.; Mewes, H.-W.; Rohrer, D. C.; Poda, G. I. Can We Estimate the Accuracy of ADME-Tox Predictions?
Drug Discov. Today 2006, 11, 700707.
182. Gramatica, P. Principles of QSAR Models Validation: Internal and External. QSAR Comb. Sci. 2007, 26, 694701.
183. Papa, E.; Villa, F.; Gramatica, P. Statistically Validated QSARs, Based on Theoretical Descriptors, for Modeling Aquatic Toxicity
of Organic Chemicals in Pimephales promelas (Fathead Minnow). J. Chem. Inf. Model. 2005, 45, 12561266.
184. Nikolova, N.; Jaworska, J. S. Approaches to Measure Chemical Similarity A Review. QSAR Comb. Sci. 2003, 22, 10061026.
185. Efron, B. The Jackknife, the Bootstrap and Other Resampling Planes; Society for Industrial and Applied Mathematics:
Philadelphia, PA, 92 pp.
186. Cramer, R. D. III; Bunce, J. D.; Patterson, D. E.; Frank, I. E. Crossvalidation, Bootstrapping and Partial Least Squares Compared
with Multiple Regression in Conventional QSAR Studies. Quant. Struct. -Act. Relat. 1988, 7, 1825.
187. Wold, S. Validation of QSARs. Quant. Struct. -Act. Relat. 1991, 10, 191193.
188. Wold, S.; Eriksson, L. Statistical Validation of QSAR Results. Validation Tools. In Chemometrics Methods in Molecular Design;
van de Waterbeemd, H., Ed.; VCH Publishers: Weinheim, Germany, 1995; Vol. 2, pp 309318.
189. Burden, F. R.; Brereton, R. G.; Walsh, P. T. A Comparison of Cross-Validation and Non-Cross-Validation Techniques:
Application to Polycyclic Aromatic Hydrocarbons Electronic Absorption Spectra. Analyst 1997, 122, 10151022.
190. Golbraikh, A.; Shen, M.; Xiao, Z.; Xiao, Y.-D.; Lee, K.-H.; Tropsha, A. Rational Selection of Training and Test Sets for the
Development of Validated QSAR Models. J. Comput. Aided Mol. Des. 2003, 17, 241253.
191. Baumann, K. Cross-Validation as the Objective Function for Variable-Selection Techniques. Trends Analyt. Chem. 2003, 22,
395406.
192. Lanteri, S. Full Validation Procedures for Feature Selection in Classification and Regression Problems. Chemom. Intell. Lab.
Syst. 1992, 15, 159169.
193. Stone, M. Cross-Validatory Choice and Assessment of Statistical Predictors. J. R. Stat. Soc. 1974, B 36, 111147.
194. Wold, S. Cross-Validatory Estimation of the Number of Components in Factor and Principal Components Models.
Technometrics 1978, 20, 397405.
195. Osten, D. W. Selection of Optimal Regression Models via Cross-Validation. J. Chemom. 1988, 2, 39.
196. Miller, A. J. Subset Selection in Regression; Chapman & Hall: London, UK, 1990; 230 pp.
197. Efron, B. Better Bootstrap Confidence Intervals. J. Am. Stat. Assoc. 1987, 82, 171200.
198. Breiman, L. Bagging Predictors. Mach. Learn. 1996, 26, 123140.
199. Clark, R. D. Boosted Leave-Many-Out Cross-Validation: The Effect of Training and Test Set Diversity on PLS Statistics.
J. Comput. Aided Mol. Des. 2003, 17, 265275.
200. Guha, R.; Serra, J. R.; Jurs, P. C. Generation of QSAR Sets with a Self-Organizing Map. J. Mol. Graph. Model. 2004, 23, 114.
201. Snarey, M.; Terrett, N. K.; Willett, P.; Wilton, D. J. Comparison of Algorithms for Dissimilarity-Based Compound Selection.
J. Mol. Graph. Model. 1997, 15, 372385.
202. Golbraikh, A.; Tropsha, A. Predictive QSAR Modeling Based on Diversity Sampling of Experimental Datasets for the Training
and Test Set Selection. Mol. Divers. 2002, 5, 231243.
Chemometrics in QSAR 169
203. Wu, W.; Walczak, B.; Massart, D. L.; Heuerding, S.; Erni, F.; Last, I. R.; Prebble, K. A. Artificial Neural Networks in Classification
of NIR Spectral Data: Design of the Training Set. Chemom. Intell. Lab. Syst. 1996, 33, 3546.
204. Gramatica, P.; Pilutti, P.; Papa, E. Approaches for Externally Validated QSAR Modelling of Nitrated Polycyclic Aromatic
Hydrocarbon Mutagenicity. SAR QSAR Environ. Res. 2007, 18, 169178.
205. Clark, M.; Cramer, R. D. III The Probability of Chance Correlation Using Partial Least Squares (PLS). Quant. Struct. -Act. Relat.
1993, 12, 137145.
206. Baumann, K.; Stiefl, N. Validation Tools for Variable Subset Regression. J. Comput. Aided Mol. Des. 2004, 18, 549562.
207. Nicholls, A.; MacCuish, N. E.; MacCuish, J. D. Variable Selection and Model Validation of 2D and 3D Molecular Descriptors.
J. Comput. Aided Mol. Des. 2004, 18, 451474.
208. Lindgren, F.; Hansen, B.; Karcher, W.; Sjo stro m, M.; Eriksson, L. Model Validation by Permutation Tests: Applications to
Variable Selection. J. Chemom. 1996, 10, 521532.
209. Clark, R. D.; Fox, P. C. Statistical Variation in Progressive Scrambling. J. Comput. Aided Mol. Des. 2004, 18, 563576.
210. Ru cker, C.; Ru cker, G.; Meringer, M. y-Randomization and Its Variants in QSPR/QSAR. J. Chem. Inf. Model. 2007, 47,
23452357.
211. Livingstone, D. J. The Characterization of Chemical Structures Using Molecular Properties. A Survey. J. Chem. Inf. Comput. Sci.
2000, 40, 195209.
170 Chemometrics in QSAR
Biographical Sketches
Roberto Todeschini is full professor of chemometrics at the Department of Environmental
Sciences of the University of MilanoBicocca (Milano, Italy), where he constituted the
Milano Chemometrics and QSAR Research Group. His main research activities concern
chemometrics in all its aspects, QSAR, molecular descriptors, multicriteria decision making
and software development. President of the International Academy of Mathematical
Chemistry, President of the Italian Chemometric Society, and ad honorem professor of
the University of Azuay (Cuenca, Ecuador), he is author of more than 150 publications on
international journals and of the books The Data Analysis Handbook, by I. E. Frank and
R. Todeschini; Elsevier, 1994 and Handbook of Molecular Descriptors, by R. Todeschini and
V. Consonni; Wiley-VCH, 2000.
Viviana Consonni got her Ph.D. in Chemical Sciences at the University of Milano in 2000
and is now full researcher of chemometrics at the Department of Environmental Sciences of
the University of MilanoBicocca (Milano, Italy). She is a member of the Milano
Chemometrics and QSAR Research Group and has 10 years experience in multivariate
analysis, QSAR, molecular descriptors, multicriteria decision making, and software devel-
opment. She is author of more than 25 publications in peer reviewed journals and of the book
Handbook of Molecular Descriptors, by R. Todeschini and V. Consonni; Wiley-VCH, 2000. She
obtained an Award for distinguished young researchers by the International Academy of
Mathematical Chemistry in 2006.
Chemometrics in QSAR 171
Paola Gramatica is full professor of Environmental Chemistry, past-Associate Professor of
Organic Chemistry, at the University of Insubria (Varese-Italy). She has been the head of
QSAR Research Unit in Environmental Chemistry and Ecotoxicology, since 1995, at the
Department of Structural and Functional Biology (DBSF), now under her direction. Her
present research field is in QSAR modeling and chemometric methods applications to
environmental organic pollutants. Recent studies deal with tropospheric oxidations of
VOC, POP persistence, pesticide partition properties, PAH mutagenicity, BCF, and endo-
crine disruptor (ED) modeling. The main field of interest is relative to persistent
bioaccumulative and toxic (PBT) chemicals and to the validation of QSAR models. She is
author of more than 100 papers in international journals (more than 60 in QSAR field), and
about 200 presentations at meetings. She is Member of the Managing Boards of the
Environmental and Cultural Heritage Division of Italian Chemical Society (SCI), of the
SCI Interdivisional Group of Green Chemistry, and also Member of the OECD Expert
Group on QSARs.
172 Chemometrics in QSAR

You might also like