Thesis

Literature thesis
Virtual screening of
Cytochrome P450 ligands:
Challenges and considerations
Andrianopsyah Mas Jaya Putra
Supervisors:
dr. Daan P. Geerke
Prof. dr. Nico P.E. Vermeulen
Department of Chemistry & Pharmaceutical Sciences

Faculty of Sciences - Vrije Universiteit, the Netherlands
July 2010
1
Contents
Abstract .................................................................................................................................. 4
1 Introduction ............................................................................................................... 5
1.1 The importance of virtual screening of CYP450 ligands ................................. 5
1.2 Characteristics of CYP450s and their substrates .............................................. 8
1.3 Validation of virtual screening model .............................................................. 9
1.4 Limitations of this thesis .................................................................................. 11
2 Docking of CYP450 ligands ..................................................................................... 12

2.1 Effect of CYP450 structure on docking of CYP450 ligands .......................... 13
2.2 Effect of water molecules in CYP450's active site on docking of CYP450
ligands ............................................................................................................. 15
2.3 Effect of ligand restraining on docking of CYP450 ligands ........................... 17
2.4 The issue of scoring function .......................................................................... 18
2.5 Summary .......................................................................................................... 18
3 Shape-matching, pharmacophore-matching, and field calculation of CYP450

ligands ........................................................................................................................ 19
3.1 Shape-matching of CYP450 ligands ............................................................... 20
3.2 Pharmacophore-matching of CYP450 ligands ................................................ 22
3.3 Field calculation of CYP450 ligands .............................................................. 23
3.4 Summary .......................................................................................................... 27
4 QSAR and classification of CYP450 ligands ........................................................... 28

4.1 QSAR of CYP450 ligands .............................................................................. 28
4.2 Classification of CYP450 ligands by machine learning ................................... 30
4.2.1 Classification of CYP450 ligands by Support Vector Machine (SVM) 31
4.2.2 Classification of CYP450 ligands by decision tree ............................. 34
4.3 Summary .......................................................................................................... 36
5 Conclusions and perspectives ................................................................................... 37
2
Acknowledgments .................................................................................................... 38
References .................................................................................................................. 39
3
Abstract
CYP450s (Cytochrome P450s) are liver enzymes involved in the Phase I metabolism.
Binding of drugs to a CYP450 can lead to formation of reactive metabolites, or to CYP450
inhibition and drug-drug interactions. For these reasons, it is necessary in an early stage of drug
development to predict possible interactions with CYP450s to reduce attrition risks. However, in-
vitro and in-vivo experiments to test CYP450 affinity and/or activity for large sets of drug
candidates can be laborious. Alternatively, computational models can be used to predict CYP450
affinities and/or activities of the compounds. This prediction then serves as a guide for in-vitro
screening; thereby, the screening could be performed more efficiently. The computational approach
to predict a compound's affinity / activity to a particular target is known as virtual screening.
Examples of virtual screening techniques are: docking, shape-matching, pharmacophore-matching,
field calculation, QSAR, and machine learning. In turn, these techniques can be classified into:
protein-based techniques (if the models are generated using a protein structure) or ligand-based
techniques (if the models are generated from structures or chemical properties of active ligands).
This thesis aims on presenting an overview of challenges and considerations in the application of
these techniques on CYP450 ligands in the last five years, related to the chemical natures of the
relevant CYP450 isoforms (e.g. flexibility of CYP450) and their ligands. ■
Keywords: Cytochrome P450, virtual screening,

docking, shape-matching, pharmacophore, field, QSAR, machine learning
4
1
Introduction
1.1. The importance of virtual screening of CYP450 ligands
CYP450s (Cytochrome P450s) are enzymes in humans, in which some of them are involved
in the Phase I metabolism in the liver (Rock et al., 2008). They contribute around 75% to the
metabolism of top 200 drugs which were prescribed in the U.S. in 2002 (Williams et al., 2004). The
most contributive CYP450s are: CYP450 3A4 (CYP3A4), CYP450 2C9 (CYP2C9), CYP450 2C19
(CYP2C19), CYP450 2D6 (CYP2D6), and CYP450 1A2 (CYP1A2) (Williams et al., 2004).
A CYP450 transforms its substrate into a more polar one, in order to ease its excretion from
the body (Boelsterli, 2009). This transformation is facilitated by an Fe atom which is bound to a
heme cofactor inside the catalytic pocket of the CYP450 (Figure 1.1). Figure 1.2 describes a typical
example of this transformation.
Figure 1.1. An example of CYP450's active site structure (1.95 Å crystal structure of CYP1A2;
PDB file: 2HI4). Blue sticks represent amino acids, and blue ribbons represent the backbone.
Reddish sticks represent the heme, and the orange ball at the center of it represents the Fe atom.
Above the heme, there is a BHF (2-phenyl-4H-benzo(H)chrome-4-one) ligand, depicted as yellow
sticks. The lone red ball represents a crystal water molecule. (Sansen et al., 2007)
5
Figure 1.2. An example of catalytic transformation by CYP450 (Rock et al., 2008). Nitrogens are
parts of heme that bind Fe (not fully drawn). “Cys” refers to cysteine of the CYP450, below the
heme. Step which is relevant to this thesis is marked by a blue dashed border.
An orally administered drug could bind to a CYP450 and be transformed into a reactive
metabolite, which is potentially hazardous (Boelsterli, 2009). Or, it functions as a CYP450 inhibitor,
leading to a drug-drug interaction (Boelsterli, 2009). The drug-drug interaction could also be caused
by the genetically-implied absence or inactivation of a CYP450 (referred as genetic polymorphism)
(Ingelman-Sundberg et al., 1999). This drug-drug interaction through CYP450 has caused the
withdrawal of several drugs (Lin et al., 1998). Due to these reasons, it is necessary to know in the
early stage of drug development if the drug would bind to CYP450, so that a decision could be
made: whether its development will be carried on or not. In Figure 1.2, this challenge is related to
step (i). Such challenge is addressed by testing the drug on CYP450s in-vitro (Zlokarnik et al.,
2005). However, this method is laborious for a large set of compounds.
Alternatively, the in-vitro test is approached by a computational model that can be used to
predict the affinities / activities of the compounds against a CYP450. This prediction then serves as
a guide for the in-vitro screening. Thereby, the in-vitro screening could be performed more
efficiently. Table 1.1 shows a comparison of typical cost between computer modeling and other
experiments.
6
Table 1.1. Typical costs of various experiments
in drug discovery and development (Young, 2009)
The computational approach to predict a compound's affinity / activity to a particular target

is known as virtual screening. Examples of virtual screening techniques are presented in Table 1.2.
These techniques can be classified into two based on their models. If their models are generated by
involving a protein structure (X-Ray, NMR, or homology), they are called structure-based
techniques (In this thesis, they are termed protein-based techniques for clarity) (Good, 2006). If
their models are generated from structure of active ligand, then they are called ligand-based
techniques (Good, 2006).
Table 1.2. Classification of virtual screening techniques
Protein-based techniques
Docking Shape-matching QSAR

Pharmacophore- Machine learnings
matching
Field calculation
Ligand-based techniques
Since a virtual screening model is just an approach or product of approximation to real

screening, it contains imperfections. However, the model could be improved by realizing the
challenges in the approximations, and could still be useful with some considerations. In line with
this idea, this thesis is aimed to present an overview of challenges and considerations in the
application of the virtual screening techniques on CYP450 ligands in the last 5 years.
7
1.2. Characteristics of CYP450s and their substrates
Table 1.3 lists available crystal structures of CYP450s so far. They are: CYP450 1A2, 2C9,
2D6, and 3A4 structures (Stjernschantz et al., 2008). Evolutionary relationships between these
CYP450s are described by a tree in Figure 1.3.
Table 1.3. Available crystal structures of human CYP450s (Stjernschantz et al., 2008)
Figure 1.3. Evolutionary relationships between several human CYP450s

(Ingelman-Sunberg et al., 1999)
8
From Figure 1.3, we acknowledge that the four CYP450s are distantly-related to each other.
The differences between them and between their substrates are summarized in Table 1.4.
Table 1.4. Characteristics of several CYP450s and their substrates in general

(Lewis et al., 2002. Arimoto, 2006. de Groot et al., 2006.)
CYP Relative volume General characteristics of substrates

of active site
1A2 Small Planar, lipophilic, neutral or basic
2C9 Medium Acidic
2D6 Medium Lipophilic, neutral or basic
3A4 Large Globular, lipophilic, neutral or basic
As shown in the table, there are differences of active site volume between the CYP450s; and also
differences in their substrate characteristics. Since the active site of CYP450 is taken into account in
protein-based virtual screening, its volume should have an impact to protein-based virtual
screening, as described in the following chapters. Meanwhile, CYP450 substrate characteristics can
be exploited for ligand-based virtual screening.
1.3. Validation of virtual screening model
A virtual screening model should be validated to see if it gives correct predictions of

compound affinities / activities. For this validation, a training dataset is provided, which consists of
a small number of substrates or inhibitors with known affinity / activity data (termed: actives). A
large number of non-substrates or non-inhibitors (termed: inactives) is added to this dataset. The
model should retrieve as many actives and as few inactives in the dataset as possible. Results of this
training are used to improve the model. The training is iterative until a final model is obtained. Then
the final model is tested again on a test dataset which has a similar composition but different
members (Triballeau et al., 2006).
Results from this validation are classified into four groups: true positives (TP), false
positives (FP), true negatives (TN), and false negatives (FN) (Kirchmair et al., 2008) (Figure 1.4).
9
True positives are actives which are retrieved by the model. False positives are inactives which are
retrieved by the model. True negatives are inactives which are not retrieved by the model. While
false negatives are inactives which are not retrieved by the model. Retrieval or selection of a model
may contain some true and false positives (TP + FP).
Figure 1.4. Validation of a virtual screening model (Kirchmair et al., 2008).

FN = False Negatives. FP = False Positives. TN = True Negatives. TP = True Positives.
Using these groups, the quality of a model is then expressed with a metrics. The most
popular metrics are: Sensitivity, Positive Precision, and Enrichment (Triballeau et al., 2006).
Sensitivity equals TP/(TP+FN). It is a measurement of how well the model can retrieve active
ligands in the training / test dataset. Positive Precision (sometimes is called: Hit Rate) equals TP/
(TP+FP) (For the rest of this report, it will be called “Precision” only). It is a measurement of how
well the model can retrieve active ligands in a the training / test dataset in the presence of inactive
ligands. Therefore, it reflects the selectivity of the model. Enrichment equals (TP/(TP+FP))/
((TP+FN)/(TP+FP+TN+FN)). Enrichment indicates how many times the model works better than a
random selection in retrieving actives. Throughout this thesis, these three metrics will be mentioned
frequently.
The quality of a virtual screening model is determined by the training and test datasets. Due
to this, the affinity / activity data of both datasets should be consistent. Li and colleagues (2008)
recommended that ligands in the datasets should have been tested in a uniform way (with the same
assay procedure, in the same laboratory), since different assay procedures from different
laboratories would result in different affinity / activity data.
10
1.4. Limitations of this thesis
Before ending this chapter, the author would like to emphasize that this thesis is limited on
the prediction of CYP450 ligand affinity / activity. It is not about the prediction of metabolite of a
CYP450 substrate, although it will touch the issue of the substrate's SOM (site of metabolism)
prediction. This thesis will also not discuss the issue of CYP450 allosteric binding site.
11
2
Docking of CYP450 ligands
With the availability of CYP450 crystal structures (Table 1.3), the binding event between a
CYP450 and its ligand can be examined more carefully by molecular dynamics. In molecular
dynamics, a protein is simulated to interact with its ligand in water in a systematic (time-connected)
and flexible manner – thereby relevant conformations and orientations of the protein and the ligand are
sampled adequately – afterwhich the standard Gibbs binding energy of the ligand (ΔG0bind) can be
calculated (Leach, 2001). This energy corresponds to the affinity of the ligand (Ki) through Equation
2.1:
ΔG0bind = 2.303 RT log Ki (Equation 2.1)
where R is the gas constant and T is the absolute temperature (Schneider et al., 2008). Therefore, the
calculated standard Gibbs binding energy of a CYP450 ligand can be used to predict its affinity.
Application of molecular dynamics for sampling CYP450 ligand conformations was
exemplified recently by Vasanthanathan and colleagues (2010). For calculation of the ΔGbind, they
employed an empirical method called LIE (Linear Interaction Energy) (For details of this method,
refer to Vasanthanathan et al., 2010). This method gave them a ΔGbind root mean square error of 3.7
kJ/mol for 8 ligands in their training set. However, as they stated, molecular dynamics is still
computationally expensive. Alternatively, a simplified technique of simulation called docking is used
for virtual screening.
In docking, the ligand's conformations and orientations (called: poses) are sampled with no time
connections (Leach, 2001). A sum of all energies of interactions between the protein and the ligand
(ΔGbind), called scoring function, is provided for every pose (Leach, 2001). It is used to rank the poses
and ligands and predict the relative affinities of the ligands. Below is an example of a scoring function,
called ChemScore, which is implemented in GOLD docking software (Equation 2.2).
ΔGbind = a + b Shbond + c Smetal + d Slipo + e Hrot (Equation 2.2)
12
In this equation, Shbond, Smetal, Slipo, and Hrot are: total energy of Hydrogen bonds, total energy of
ligand's interactions with metal (e.g. with Fe atom at heme in the active site of a CYP450), total energy
of lipophilic interactions, and an additional score to represent the loss of conformational entropy of a
ligand upon binding, respectively. Meanwhile, a-e are parameters which were obtained from a
regression analysis of the energies and score against the experimental ΔGbind for a training set of
protein-ligand complexes (Verdonk et al., 2003).
The challenge in docking is to obtain correct docking pose(s) of a ligand, and to rank the ligand
correctly to afford a good prediction of its relative affinity. For a CYP450 substrate, a docking pose is
considered correct if the site of metabolism of the substrate is placed (arbitrarily) within 6 Å from the
Fe atom at the heme cofactor in the CYP450's active site, with respects to amino acids in the active site
(Kirton et al., 2005). In the following passages, we will discuss factors that influence the docking of
CYP450 ligands.
2.1. Effect of CYP450 structure on docking of CYP450 ligands
In 2008, Hritz and colleagues reported a docking study of CYP2D6 substrates using the apo-
(ligandless) crystal structure of CYP2D6 (PDB ID: 2F9Q). They realized that the crystal structure is
too tight to accommodate known CYP2D6 substrates. Therefore, they relaxed the structure by
customizing it to a known CYP2D6 substrate, (R)-propranolol; and performed a molecular dynamics
simulation of the complex by considering the thermal motion of Phe483 of CYP2D6. This motion
generally involved only small changes of CYP2D6's conformation. From the molecular dynamics
simulation, they extracted 250 conformations of CYP2D6. Then, they docked 65 CYP2D6 substrates
with known sites of metabolism into each of the conformations. They discovered that some of the
CYP2D6 conformations gave significantly higher percentages of correct docking poses than the others,
although the differences between those conformations were small (Figure 2.1). Due to this result, they
recommended that the CYP2D6 structure for a docking study should be selected carefully (In this case,
they recommended the CYP2D6 conformation that gave the highest percentage of correct docking pose
in Figure 2.1).
13
Figure 2.1. Different frames (conformations) of CYP2D6 gave different percentages of correct binding
poses of its substrates (Hritz et al., 2008). The highest percentage was given by frame 216 (marked by
the longest vertical line).
Previously, Polgar and colleagues (2007) performed a docking study of CYP2C9 ligands using
three crystal structures of CYP2C9 (PDB IDs: 1OG2, 1OG5 and 1R90). The 1OG2 structure is an
apostructure; the 1OG5 structure is in complex with S-warfarin; while the 1R90 structure is in complex
with flurbiproven. Polgar and colleagues aligned the 1OG5 structure with the 1R90 structure by
homology sequence, then merged S-warfarin in the 1OG5 structure into the 1R90 structure. The latter
was customized to accommodate S-warfarin as well. Afterwards, they docked a dataset of 5,411
compounds (containing 42 CYP2C9 ligands) into the three CYP2C9 structures. They discovered that
the customized 1R90 structure gave higher enrichment at 1% of the dataset than the apo and 1OG5
structures (Figure 2.2). This result, together with the one from Hritz et al. (2008), suggest that
flexibility of a CYP450 should be taken into account in docking; and that sometimes, a reasonable
customization of a CYP450 structure is necessary to afford better results of docking of its ligands.
14
Figure 2.2. Enrichments from docking of 5,411 compounds (containing 42 CYP2C9 ligands) into the
1OG2 (apo), 1OG5, and 1R90 (customized) structures of CYP2C9 (Polgar et al., 2007). FlexX is the
docking software. Gold (Goldscore), PMF, and Chem (ChemScore) are the scoring functions.
2.2. Effect of water molecules in CYP450's active site on docking of CYP450 ligands
In biological situation, water molecules could be present in the active site of CYP450. As a
proof, there are several water molecules trapped in the active site of the crystal structure of CYP2D6
(Rowland et al., 2006). Water molecules influence the docking of a CYP450 ligand in one of two ways
(Santos et al., 2010). First, they prevent the ligand from occupying region far from the heme (Figure
2.3, left). Second, they make H-bonds with the ligand that orient the ligand into a correct or an
incorrect pose (Figure 2.3, right).
Santos and colleagues (2010) investigated the effect of inclusion of water molecules in docking
of CYP2D6 substrates. For this purpose, they used the crystal structure of CYP2D6 (PDB ID: 2F9Q),
and generated 8 conformations from it; and used MDEA (R-3,4-methylenedioxy-N-ethylamphetamine)
to generate a set of most favorable water positions within the active site of CYP2D6. Then, they
docked 11 MDEA-like substrates and 53 non-MDEA-like substrates into each CYP2D6 conformation.
They discovered that inclusion of water molecules in the active site of 2F9Q improved the percentage
of correctly docked MDEA-like substrates, but not non-MDEA-like substrates (Table 2.1). Based on
this result, they recommended that water molecules should not be excluded completely in docking.
15
However, they also recommended that different set of water molecules should be used for different
class of substrates.
Figure 2.3. Water molecules (balls) influence the docking of a CYP450 ligand (wires) in one of two
ways: by preventing the ligand from occupying region far from the heme (green sticks at the bottom)
(left), or by making H-bonds with the ligand (right) (Santos et al., 2010). Red wires represent the
ligand when water is excluded, while blue wires represent the ligand when water is included.
Table 2.1. Percentage of correctly docked MDEA-like and non-MDEA-like substrates for different
CYP2D6 conformations (Santos et al., 2010). “HOH OFF” means water is excluded, while “HOH
toggle” means water is included, but allowed to be temporarily displaced by a ligand.
16
2.3. Effect of ligand restraining on docking of CYP450 ligands
In docking, ligand conformational sampling is sometimes conducted in the area which is not so
relevant to ligand binding, making the sampling inefficient. To improve the sampling efficiency, the
ligand can be restrained to interact with important amino acids only. This was exemplified by Polgar
and colleagues (2007) in the docking of CYP2C9 ligands. In CYP2C9, Arg108 is supposed to be
crucial for binding (Ridderstrom et al., 2000. Dickmann et al., 2004). By restraining CYP2C9 ligands
to interact with this amino acid, Polgar and colleagues obtained a high enrichment at 1% of their
dataset, compared to almost zero enrichment for not restraining the ligands to the amino acid (Figure
2.4).
Alternatively, ligand poses are filtered for their interactions with important amino acids before
they are scored. Poses which have such interactions will be passed to the scoring. This filtering step can
be performed with interaction fingerprints, as described in the report of Mpamhanga and colleagues
(2006). However, there has been no report of the application of this method on the docking poses of
CYP450 ligands.
Figure 2.4. Enrichments from docking of 5,411 compounds (containing 42 CYP2C9 ligands) into the
1R90 crystal structure of CYP2C9, with restraining (black and red lines) and without restraining (blue
and green lines) to Arg108 of the CYP2C9 (Polgar et al., 2007). FlexX is the docking software and
scoring function. Chem (ChemScore) is also a scoring function. “NO_Arg108” means that ligands were
not restrained to interact with Arg108.
17
2.4. The issue of scoring function
In CYP450, there is a lipophilic environment above the heme group for which some scoring
functions perform poorly with their lipophilic term (e.g. Slipo in Equation 2.2) (Kirton et al., 2005).
Additionally, the ligand-Fe interaction energy (e.g. Smetal in Equation 2.2) in a scoring function should
be balanced with other terms for CYP450 ligands which could coordinate directly to the heme (Kirton
et al., 2005). Reparameterization of these terms could improve the performance of the scoring function,
as exemplified by Kirton and colleagues (2005) on ChemScore scoring function of GOLD docking
program. Because scores of CYP450 ligands determine their ranks in virtual screening, improvement of
the scoring function should result in the improvement of the virtual screening.
2.5. Summary
In summary, docking can be used to screen CYP450 ligands based on their Gibbs binding
energies (ΔGbind), when the CYP450 structure is available. The challenge in docking is to obtain correct
docking pose(s) of a ligand, and to rank the ligand correctly to afford a good prediction of its relative
affinity. To obtain correct docking pose of a CYP450 ligand, one should consider the effect of: CYP450
structure, inclusion of water molecules in CYP450's active site, and ligand restraining to important
CYP450 amino acids. To rank the ligand correctly, one should reparameterize the scoring function to fit
the chemical environment of CYP450s' active sites.
18
3
Shape-matching, pharmacophore-matching,
and field calculation of CYP450 ligands
The bound state of a ligand, in which it exerts its affinity / activity, can be represented by the
three-dimensional “lock and key” model. In this model, a ligand would bind to its protein if its
functional groups and shape are complementary to the amino acids and shape of the protein's
active / binding site, respectively (Motiejunas et al., 2006). Figure 3.1 illustrates these
complementarities. The complementarities can be exploited for virtual screening. They serve as
basis for several virtual screening techniques at the interface between protein-based and ligand-
based virtual screening (Table 1.2), namely: shape-matching, pharmacophore-matching, and field
calculation (Good, 2006. Vistoli et al., 2006).
Figure 3.1. Left: CYP1A2 (blue) in complex with 2-phenyl-4H-benzo(H)chrome-4-one (yellow)

(Sansen et al., 2007). The phenylalanines (F226, F256, and F260) of CYP1A2 are complementary
to the nearby aromatic ring of the ligand. Reddish sticks represent the heme, and the orange ball at
the center of the heme represents the Fe atom. Lone red ball represents a water molecule. Right:
Shape complementarity between CYP1A2's active site (represented by a Gaussian surface) and the
same ligand (joined balls) (adapted from PDB file 2HI4 from Sansen et al., 2007). Same renderings
from the left picture apply for the heme, Fe atom, and water molecule.
19
3.1. Shape-matching of CYP450 ligands
Shape-matching can be done in two ways: ligand-based and protein-based. In ligand-based

shape-matching, the shape of an active is used as a query (Ebalunode et al., 2010. Putta et al.,
2007). While in protein-based shape-matching, the shape of a protein's active / binding site is
abstracted to produce the so-called “negative image” of the active / binding site; then this “negative
image” is used as the query (Ebalunode et al., 2010). By matching any of these two queries with the
shape of compounds which are going to be screened, predictions can be made on the compounds'
affinities / activities. If they match, the compounds might have similar affinities / activities with the
ligand (the Similarity Principle) (Schneider et al., 2008).
Figure 3.2 illustrates a ligand-based shape-matching algorithm which is implemented in
ROCS (ROCS 3.0.0 Manual, 2009), the most popular software for shape-matching at the moment
according to Laggner et al. (2008). In this algorithm, molecular shapes are represented by Gaussian
functions. ROCS overlays Gaussians of a query with those of a molecule in dataset on their centers
of mass, then optimizes the overlay to find the best match. The match is expressed with a score (e.g.
the ShapeTanimoto score (Equation 3.1)), which ranges from 0 (for the most dissimilar ligand) to 1
(for an exact match).
Figure 3.2. Ligand-based shape-matching algorithm in ROCS (ROCS 3.0.0 Manual, 2009).
Shape = Overlap Gaussians

Tanimoto Gaussians of query - Gaussians of molecules - Overlap Gaussians
(Equation 3.1)
20
In ROCS, the shape match can also be calculated by considering only particular functional
groups of the molecules (represented by Mills Dean force fields) (Figure 3.3), using another score
called the ColorTanimoto score, which is similar to the ShapeTanimoto score. Combination of
ColorTanimoto score with the aforementioned ShapeTanimoto score is also provided in ROCS,
called the TanimotoCombo score, which ranges from 0 (for the most dissimilar ligand) to 2 (for an
exact match). Freitas et al. (2010) discovered that consideration of functional groups in ligand-
based shape-matching of CYP450 ligands helped to reduce false positives significantly.
In ligand-based shape-matching, the query structure plays a crucial role. Different queries
deliver different scores of matches, as discovered by Sykes and colleagues (2008) for CYP2C9
substrates (Table 3.1). This notion was confirmed by Freitas and colleagues (2010) for CYP2D6
substrates. These findings highlight the importance of choosing the right query for shape-matching,
as recommended by Kirchmair and colleagues (2009).
Figure 3.3. Example of a shape model of functional groups, generated by ROCS (adapted from
ROCS 3.0.0 Manual, 2009). “Donor” means H-bond donor group. “Acceptor” means H-bond
acceptor group. “Hydrophobe” means hydrophobic group.
Table 3.1. ROCS Combo Scores for CYP2C9 substrates with different queries (Sykes et al., 2008).
CYP2C9 ROCS Combo Score ROCS Combo Score
substrate with Fluoxetine query with Flurbiprofen query
Amitriptyline 1.161 0.855
Carvedilol 1.251 0.911
Lansoprazole 1.218 1.028
21
Protein-based shape-matching has never been applied to screen CYP450 ligands. If such
application is conducted, however, it would face the challenge from CYP450s' flexibilities. For
example, CYP3A4 is known to be able to accommodate multiple ligands simultaneously (Ekroos et
al., 2006. Kapelyukh et al., 2008). At such condition, the conformation of CYP3A4 may give a
promiscuous query (The query would easily retrieve true and false positives). Schneider and
colleagues (2008) suggested that protein-based shape-matching is successful only “if the binding
site of the target is small and buried”.
3.2. Pharmacophore-matching of CYP450 ligands
As mentioned at the beginning of this chapter, the complementarity between amino acids of
a protein's active / binding site and functional groups of its ligand can be exploited for virtual
screening. The functional groups can be: a positively-charged group, a negatively-charged one, an
H-bond donor, an H-bond acceptor, an aromatic ring, or a hydrophobic one (Schneider et al., 2008).
Once these groups are recognized from a ligand, a model of framework can be generated on them
that identifies their types and relative positions from each other. This model is called
pharmacophore. An example of a pharmacophore is depicted in Figure 3.4. By matching this
model with the functional groups of the compounds which are going to be screened, predictions can
be made on the compounds' affinities / activities. If they match, the compounds could have similar
affinities / activities with the ligand (Schneider et al., 2008).
There are four ways to recognize functional groups of a ligand which are involved in the
protein-ligand binding, in order to generate a pharmacophore on them: (1) by a structure-
affinity/activity relationship study of the ligand; (2) by alignment with an active (preferably rigid);
(3) by a crystal structure of the ligand in complex with its protein; (4) by docking the ligand into its
protein's active / binding site, with supports from experimental data (e.g. site-directed mutagenesis
data) (Schneider et al., 2008. Locuson et al., 2005).
In the last 5 years, pharmacophore-matching of CYP450 ligands has been reviewed by de
Groot (2006) and reported by Schuster and colleagues (2006). Both de Groot (2006) and Schuster et
al. (2006) warned about the issue of multiple binding modes of a CYP450 ligand (as illustrated in
Figure 3.5) and concomitant occupation of one CYP450 active site by multiple ligands (Ekroos et
al., 2006. Kapelyukh et al., 2008), which eventually lead to more than one pharmacophore. Due to
these reasons, pharmacophore-matching might not be the best technique to reliably differentiate
between actives and inactives for CYP450 family (Schuster et al., 2006).
22
Figure 3.4. Pharmacophore of CYP1A2 inhibitor, generated from sulconazole (sticks) (Schuster et
al., 2006). Blue ball represents a hydrophobic group. Brown ball represents an aromatic ring. Green
balls represents H-bond acceptors. Grey surface represents the shape of sulconazole.
Figure 3.5. Possible multiple binding mode of a ligand in CYP3A4 active site, marked by green and
purple lines (Mao et al., 2006). Here, the ligand is BFC (7-benzyloxy-4-(trifluoromethyl)-coumarin.
3.3. Field calculation of CYP450 ligands
While pharmacophore refers to the binding functional groups of a ligand or a protein's active
/ binding site, fields refer to the potential forces of the groups (Schneider et al., 2008). By taking
fields into account, one could have a more realistic model compared to the shape model and
pharmacophore. One prominent method for calculation of fields is CoMFA (Comparative Molecular
Field Analysis) (Schneider et al., 2008). The method follows these steps (Schneider et al., 2008):
23
1. Ligands with known affinities / activities are aligned three-dimensionally to obtain their
common binding modes (Figure 3.6).
2. A virtual cubic lattice is generated around the ligands, which consists of discrete vertical and
horizontal layers (Figure 3.6). Then, a probe is placed at every intersection of the layers
(lattice point), usually 1-2 Å away from each other. The probe can be: a positively or
negatively charged atom (to represent ionic force), an H-bond donor or acceptor, a
hydrophobic group, or an sp3 Carbon atom (to represent a steric force). Each probe is
supposed to interact with the nearest atom(s) of the ligands, and the interaction force (field)
between them is calculated as energy (Unlike docking, in which the interaction is evaluated
between the ligand and its protein).
Figure 3.6. In CoMFA, a virtual cubic lattice is generated around ligands which have been aligned,
in order to provide positions for probes that will interact with the ligands. Interaction energies
between the probes and the ligands are calculated. The calculation results for all probes and ligands
are stored in columns. (Schneider et al., 2008)
3. The whole calculated energies are correlated to the affinity / activity of the ligands using a
linear correlation method (e.g. Partial Least Squares). The result is an equation like this:
L P
Affinity / activity = c + Σ Σ aij Eij (Equation 3.2)
i=1 j=1
24
where c is a constant; L is lattice point; P is probe; and the coefficients aij correspond to
placing probe j at lattice point i yielding the energy value Eij. The weight between the
coefficients reflects favorable field(s) over the others, which is useful for drug design. The
quality of this equation is judged by correlation between the calculated and experimental
affinities / activities, which is expressed with a squarred correlation coefficient (R2). The
equation is considered good if its R2 approaches the value of 1.00.
For a ligand whose affinity / activity is not known, the first two steps are applied. The resulting
energy values can then be used to predict its affinity / activity through Equation 3.2. However, one
should realize that the relationship between affinity / activity and the energies is assumed to be
linear, while it is not necessarily so (Schneider et al., 2008).
Clearly, this technique is dependent on the actives (training set) which are used to generate
the fields. Peng and colleagues (2008) showed that for CYP2C9 inhibitors, two training sets with
different numbers of ligands and different Ki ranges delivered different standard errors to Equation
3.2 (Table 3.2). Locuson and colleague (2005) suggested that inclusion of highly-active CYP450
ligands would improve Equation 3.2, because they represent strong interactions with CYP450.
Table 3.2. CoMFA statistics of CYP2C9 inhibitors in Peng et al. (2008)

Training set Ki range (nM) Number of ligands Standard error
Complete 1.0 – 48,000 83 0.822
Chromenones only 4.2 – 6,740 11 0.243
CoMFA is also dependent on the alignment of the active ligands (Locuson et al., 2005).
Since there could be more than one mode of alignment, especially for dissimilar and flexible
ligands, the challenge is to get the correct alignment to obtain their common binding modes. In the
case of CYP450 ligands, there are two ways to resolve this problem.
One is to determine reference atoms or groups in the ligands for the alignment, which
represent their characteristics in general. For example, for CYP2D6 substrates which mostly have a
protonated amine group, Haji-Momenian and colleagues (2003) used the amine as the reference for
their alignment. Additionally, they used the site of metabolism of each CYP2D6 substrate (based on
25
the same metabolism / reaction) as the reference. This method gave them a squared correlation
coefficient (R2) of 0.62 for their test dataset.
The second way is to involve the CYP450 structure for docking the ligands, after which the
docking poses of the ligands are obtained that agree with site-directed mutagenesis and metabolism
data. Then, these poses are aligned at hemes (Figure 3.7). The resulting alignment is then extracted
from the active site for use in CoMFA. Such method was exemplified by Yasuo and colleagues
(2009) for the alignment of CYP2C9 inhibitors, and it gave them a squared correlation coefficient
(R2) of 0.941 for their dataset. Of course, when it comes to docking of CYP450 ligands, the
challenges and considerations which are mentioned in Chapter 2 (CYP450 flexibilities, inclusion of
water molecules, etc.) apply.
Figure 3.7. An alternative way to align CYP450 ligands for CoMFA: by alignment of their docking
poses at heme (Yasuo et al., 2009)
Finally, there is also a method beyond CoMFA to generate fields without ligand alignment,
by using conformers of ligands. As exemplified by Afzelius and colleagues (2004) with CYP2C9
ligands, 100 conformers were generated for each of the ligands, then all the conformers were used
to generate fields, assuming that the true binding modes of the ligands are within those conformers
(Figure 3.8). This method gave them an R2 of 0.8 for their dataset. Gunther and colleague (2006)
confirmed that the number of conformers per ligand is sufficient to cover binding modes of a
ligand.
26
4
1
3
2
Figure 3.8. Generation of fields from conformers (adapted from Afzelius et al., 2004). (1) 3R,5S-
fluvastatin, a CYP2C9 inhibitor. (2) Fields generated from original structure. (3) Fields generated
from a conformer of the ligand. (4) Fields generated from 100 conformers of the ligand.
3.4. Summary
Shape-matching, pharmacophore-matching, and field calculation are virtual screening

techniques based on the “lock and key” model of protein-ligand binding. Each has its own
challenges and considerations.
Protein-based shape-matching is successful only if the active site of the target is small and
buried. Ligand-based shape-matching of CYP450 ligands requires careful choice of a query ligand.
Pharmacophore-matching might not be suitable to screen CYP450 ligands, due to possible
multiple binding modes of a CYP450 ligand, and possible concomitant occupation of one CYP450
active site by multiple ligands.
Field calculation can be improved by inclusion of highly-active CYP450 ligands in the
training set. Particularly in CoMFA (Comparative Molecular Field Analysis), correct alignment of
CYP450 ligands should be considered. While the relationship between affinity / activity and the
interaction energies in CoMFA is assumed to be linear, it is not necessarily so.
27
4
QSAR and classification of CYP450 ligands
4.1. QSAR of CYP450 ligands
In the absence of a CYP450's structure, its ligands can be virtually screened by QSAR
(Quantitative Structure-Activity Relationship). The idea of QSAR is to correlate affinities or
activities of the ligands with their molecular descriptors, usually through a linear equation.
Examples of such equation are presented below (Equation 4.1 – 4.3), which account for the
inhibitory activities of 21 flavonoids on CYP1A2 (Roy et al., 2008).
- log IC50 = 3.48 – 0.09 3Ka – 0.21 3Xc – 0.49 3XcV + 1.32 S_dsCH + 0.17 S_aaCH – 0.20 S_dssC –
0.14 S_aasC (Equation 4.1)
- log IC50 = - 0.56 + 0.19 S_aaCH + 0.99 S_dsCH + 1.69 JX (Equation 4.2)
- log IC50 = - 0.17 + 1.5 JX + 1.10 S_dsCH + 0.19 S_aaCH (Equation 4.3)
In these examples, the inhibitory activities of the flavonoids are correlated with their
descriptors: 3Ka (a kappa shape index); 3Xc and 3XcV (connectivity indices); S_dssC, S_aasC
(electrotopological state parameters) (in Equation 4.1), S_dsCH, S_aaCH (electrotopological state
parameters), and JX (Balaban J topological parameter) (in Equation 4.1 – 4.3) (For details about
these descriptors, refer to: Todeschini et al., 2000). These descriptors can be calculated for a
compound whose IC50 is not known, leading to its predicted IC50 through any of the above equations
The three equations gave squared correlation coefficients (R2) of: 0.745; 0.801; and 0.840;
respectively. These data suggest that different types of descriptors deliver different statistical
qualities of correlation. And apparently, more descriptors do not always bring to better quality, as
indicated between Equation 1 and 2 or 3. Therefore, the challenge in QSAR of CYP450 ligands is
to find the types and numbers of descriptors that would deliver the highest quality of correlation.
28
In the report of Roy and colleagues above (2008), the descriptors were selected
automatically by different algorithms. Descriptors in Equation 1 were selected by PLS (Partial Least
Squares) algorithm; and descriptors in Equation 2 were selected by GFA (Genetic Function
Approximation) algorithm; while descriptors in Equation 3 were selected by a combination of both
algorithms called G/PLS (Genetic Partial Least Squares) (For details about these algorithms, refer
to Roy et al., 2008). These algorithms proved to give high qualities of correlation in the case of
CYP1A2 ligands. However, they do not guarantee to select physico-chemically meaningful
descriptors (Li et al., 2007. Li et al., 2008).
CYP450s and their substrates have some general characteristics which are summarized in
Table 1.4. These characteristics can be described by physico-chemically meaningful types of
descriptors such as: size, shape, electrostatic, H-bond donor, H-bond acceptor, and hydrophobic
descriptors. It is not surprising that several QSAR studies of CYP450 ligands in the last 5 years
eventually come to one of these types of descriptors (Table 4.1). With more understanding of
CYP450s' active sites, the search for their ligands' descriptors could be directed to those which are
relevant to the active sites and physico-chemically meaningful.
Table 4.1. The uses of several types of descriptors in QSAR studies of CYP450 ligands in the last 5
years
Type of descriptors
CYP Size Shape Electrostatic H-bond H-bond Hydrophobic Reference
donor acceptor
1A2 - - - - √ - Appiah-Opong et al., 2008
- - - - √ Iori et al., 2005
- √ - - - - Roy et al., 2008
2C9 - - - - - √ Appiah-Opong et al., 2008
2D6 - - - - - √ Appiah-Opong et al., 2008
- - - - √ Ringsted et al., 2009
3A4 - - - - - √ Chuman, 2008
Beside of the descriptors, the quality of a QSAR equation is also dependent on the training
dataset which is used to generate it. Li and colleagues (2008) recommended that the training dataset
should have a sufficient diversity, which can be expressed by a diversity index (DI). An example of
a diversity index is given in Equation 4.4.
29
(Equation 4.4)
Here, div(A) is the diversity index of A dataset; diss(i,j) is dissimilarity between ligand i and j in the
dataset; and N is the number of ligands in the dataset (Perez, 2005). Dissimilarity is simply defined
as (1 – similarity), and the similarity can be calculated by a similarity index (e.g. Tanimoto index).
As mentioned above, a QSAR equation is usually linear. However, one should realize that
the structure-activity relationships of CYP450 ligands are not necessarily so. In the case of
CYP3A4, which has more than one ligand binding site (Ekroos et al., 2006. Kapelyukh et al.,
2008), its ligands would have more than one common binding mode. Hence, their affinities cannot
be represented by one linear-QSAR equation only (Mao et al., 2006). In other words, their
structure-activity relationships are not linear. For non-linear structure-activity relationships,
machine learning techniques are more suitable, which will be discussed next.
4.2. Classification of CYP450 ligands by machine learning
While descriptors are used to make an equation of correlation in QSAR, they can also be
used to classify ligands by machine learning. Unlike QSAR, classification is a qualitative prediction
of relative affinities / activities of ligands. Table 4.2 lists several machine learning techniques which
have been applied on CYP450 ligands in the last 5 years.
Table 4.2. Classification techniques applied on CYP450 ligands in the last 5 years
Classification technique Applied on CYP Reference
1A2 2C9 2D6 3A4
Decision tree - - - √ Choi et al., 2009
√ - - - Vasanthanathan et al., 2009
√ - √ - Burton et al., 2006
- √ √ √ Hudelson et al., 2006
Hierarchical clustering - - - √ Meslamani et al., 2009
√ √ √ √ Yamashita et al., 2008
30
k-nearest neighbour - - √ √ Jensen et al., 2007
Neural networks - - √ - Bazeley et al. 2006
√ - - - Chohan et al., 2005
Principal Component Analysis - √ - √ Nath et al., 2008
(PCA) √ √ √ √ Fukunishi et al., 2006
Recursive partitioning √ - √ - Burton et al., 2009
- √ - - Hudelson et al., 2008
Self-Organizing Maps (SOM) √ √ √ √ Veith et al., 2009
Support Vector Machine (SVM) √ - - - Vasanthanathan et al., 2009
√ √ √ √ Michielan et al., 2009
- - √ - Eitrich et al., 2007
- √ √ √ Terfloth et al., 2007
- √ - - Koike, 2006
- - - √ Arimoto et al., 2005
- √ √ √ Yap et al., 2005
This chapter will discuss only two most applied machine learning techniques in Table 4.2, namely:
Support Vector Machine (SVM) and decision tree.
4.2.1. Classification of CYP450 ligands with Support Vector Machine (SVM)
Figure 4.1 illustrates how SVM works. Suppose we have a training set that contains some
CYP450 actives (green) and inactives (red). Plotting the two groups by their descriptors two
dimensionally results in a non-linearity, which makes them difficult to correlate (In QSAR, some
members of these groups would be treated as outliers, which could be excluded to provide a better
correlation). Projection of these groups into another dimension by a function (called: kernel (κ))
would position the groups completely separated from each other, that a hyperplane could be inserted
between them (Schneider et al., 2008). The best separating hyperplane is the one evenly far away
from both groups. The closest actives and inactives to the hyperplane are called support vectors
(from which the name of this classification technique came); and their equal distances to the
hyperplane are called margins. Each group then get its attribute according to its relative position to
the hyperplane.
31
descriptor A
projection
descriptor B
Figure 4.1. SVM works by projecting the two groups (green and red ones) into another dimension,
to afford a complete separation of them with a hyperplane (adapted from van Looy et al., 2007).
The hyperplane function can be searched with some mathematical and computational efforts.
Once this hyperplane is found, the projection can be used as a model to classify CYP450 ligands.
Compounds whose affinities / activities against CYP450 are not known can be subjected to this
projection to acknowledge which atributes they get: whether the attributes are similar to those of the
actives or those of the inactives. Based on the results of this projection, predictions can be made
qualitatively for the affinities / activities of the compounds.
Obviously, the discriminating power of SVM lies on the kernel which is implemented.
Eitrich and colleagues (2007) exemplified how different kernels delivered different Hit Rates (Table
4.3). This finding highlights the importance of choosing the best kernel.
Table 4.3. Effect of different kernels of SVM on the virtual screening of CYP2D6 ligands (adapted
from Eitrich et al., 2007). Each dataset contains 13 CYP2D6 inhibitors. Numbers in brackets are
numbers of descriptors assigned to the dataset. For details of the kernels, refer to Eitrich et al.
(2007).
Kernel Hit Rate for Dataset
1a (5) 1b (10) 1c (20) 1d (557) 1e (5) 1f (10) 1g (20) 1h (557)
Gaussian 0.85 0.77 0.62 0.77 0.69 0.69 0.38 0.69
Slater 0.85 0.62 0.46 0.92 0.69 0.54 0.38 0.62
32
In Figure 4.1, only 2 descriptors were used. Actually, SVM can accommodate hundreds of
descriptors at the same time, making its projection unimaginable. However, at some point, the
increase of descriptor numbers does not add to Specificity significantly anymore, as discovered by
Yap and colleagues (2005) in the SVM application on CYP3A4 substrates (Table 4.4). On the other
hand, the more descriptors used, the more difficult it is to acknowledge their contributions to the
separation. Therefore, one should consider using as few physico-chemically meaningful descriptors
as possible in SVM.
Table 4.4. Effect of number of descriptors to Sensitivity and Specificity in the SVM application on
CYP3A4 substrates (Yap et al., 2005). Numbers in brackets are standard deviations. The descriptors
were selected automatically by Genetic Algorithm (GA) from 1,497 descriptors available in
DRAGON Web 3.0 software. Beyond 400 descriptors, the increase of descriptor numbers did not
add to Specificity significantly anymore.
In the above report, Yap and colleagues (2005) applied SVM on CYP2C9, CYP2D6, and
CYP3A4 substrates and inhibitors. They discovered that for the classifications of these CYP450
susbtrates and inhibitors, shape and electrostatic types of descriptors were the mostly selected,
leading to high Hit Rates (Table 4.5). This result suggested that shape and electrostatic types of
descriptors are relevant to the characteristics of CYP450s and their substrates (Table 1.4).
33
Table 4.5. Contributions of descriptors to Hit Rates in SVM applications on CYP2C9, CYP2D6,
and CYP3A4 ligands (adapted from Yap et al., 2005). S = substrates; nS = non-substrates; I =
inhibitors; nI = non-inhibitors. Numbers in brackets are standard deviations. Highest percentages of
descriptors are presented in blue.
Percentage of selected types of descriptors

CYP Dataset Hit Rate
Size Shape Electrostatic H-bond H-bond Hydrophobic
donor acceptor
2C9 S and nS 6.8 58.2 19.1 3.0 3.5 9.4 99.2 (0.9)
I and nS 7.1 56.8 20.4 3.3 3.6 8.8 97.3 (1.3)
2D6 S and nS 6.3 59.7 18.9 3.5 3.1 8.5 96.9 (1.5)
I and nS 7.5 57.1 20.5 2.5 2.4 8.8 96.7 (1.6)
3A4 S and nS 7.5 57.2 21.0 1.9 2.8 9.5 85.2 (3.0)
I and nS 7.1 56.8 20.4 3.3 3.6 8.8 97.9 (1.5)
Like QSAR, an SVM model is dependent on the quality of the training dataset. Therefore,
the same considerations about training dataset in QSAR (diversity and uniformity of assay) apply
here as well.
It might be temptating to use SVM to address substrate selectivities between CYP450s.
However, one should realize that a compound could be metabolized by more than one CYP450
(Michielan et al., 2009); while SVM – in contrast – offers a complete distinction between substrates
of different CYP450s. Therefore, the application of SVM to address substrate specificities between
CYP450s is not recommended.
4.2.2. Classification of CYP450 ligands with decision tree
A decision tree splits a dataset into “leaves” based on thresholds of descriptors in series
(Rose, 2003). The types and number of descriptors, their thresholds, the number of resulted leaves,
and the next leaves to split can be decided by a user or automatically by a program. The results
serve as qualitative predictions of the affinities / activities of the compounds in the dataset. Figure
4.1 presents examples of decision tree which were set up to separate CYP1A2 inhibitors from
CYP1A2 non-inhibitors.
34
Here, too, the understandings of a CYP450's characteristics would help to decide which and
how many descriptors to use, so that the tree could be built on as few physico-chemically
meaningful descriptors as possible. The threshold of each descriptor could be optimized with a
training dataset; therefore, a sufficiently diverse training dataset should be provided.
Figure 4.2. Decision trees for CYP1A2 ligands, from Vasanthanathan et al. (2009) (left) and Burton
et al. (2006) (right). On the left tree, the numbers are thresholds. On the right tree, SMR_VSA6,
SlogP_VSA7, SlogP_VSA9, and PEOE_VSA+4 are descriptors. Inhibitors are symbolized by “+”,
and non-inhibitors are symbolized by “-”. Numbers in brackets are true and false positives
respectively.
Unlike SVM, a decision tree could produce multiple classes from a dataset. Therefore, this
method is useful to address substrate selectivities between CYP450s qualitatively. More than 5
years ago, Lewis (2003) used a decision tree with 5 descriptors (volume, pKa, a/d2 (area/depth2),
ELUMO, and log P) to discriminate CYP450 substrates (Figure 4.3). The tree gave an overall
correlation of 94%. Of the five descriptors, four (volume, pKa, a/d2, and log P) confirm the
characteristics of CYP450s and their substrates in Table 1.4.
35
Figure 4.4. A decision tree to classify CYP450 substrates (Lewis, 2003)
4.3. Summary
In the absence of a CYP450's structure, its ligands can be virtually screened by QSAR
(Quantitative Structure-Activity Relationship) or machine learning. Both techniques utilize
molecular descriptors. In QSAR, the descriptors are correlated to affinity / activity; while in
machine learning, they are used to classify ligands. The challenge in QSAR and machine learning is
to find the types and numbers of descriptors that would deliver the best correlation or classification,
respectively. Physico-chemically meaningful types of descriptors like: shape, electrostatic, H-bond
acceptor, and hydrophobic descriptors have proven to be useful in QSAR or machine learning.
Support Vector Machine (SVM) is a machine learning that works by projecting two groups
in a dataset into another dimension to afford a complete separation of them. Because SVM offers
such a complete distinction between two groups, it is not suitable to address selectivity of a
substrate which is metabolized by more than one CYP450. Decision tree can split a dataset of
CYP450 substrates into multiple classes, so it is more suitable to address the selectivity issue.
36
Chapter 5
Conclusions and Perspectives
The author have presented an overview of the applications of six techniques (docking,
shape-matching, pharmacophore-matching, field calculation, QSAR, and machine learning) for
virtual screening of CYP450 ligands in the last five years, with focus on challenges and
considerations in the application. Throughout this thesis, the challenges and considerations are
described as consequences of the chemical natures of the relevant CYP450 isoforms. Flexibility of a
CYP450 has an effect on docking pose of its ligands and also the success of protein-based shape-
matching; and the environment of its active site should be considered for improvement of scoring
function, and for selection of descriptors in QSAR and machine learning. Special attention is given
to CYP3A4, since this CYP450 has the largest volume of active site. The capability of CYP3A4 to
bind multiple ligands implies that protein-based shape model (“negative image”) of this CYP450 is
too promiscuous for screening its ligands; and that there can be more than one model of
pharmacophore or QSAR for its ligands (non-linear structure-activity relationships).
Training and test dataset are other issues to consider in virtual screening, since they
determines the quality of validation. As mentioned earlier, ligands in a training and test dataset
should be sufficiently diverse, and should have been tested in a uniform way (with the same assay
procedure, in the same laboratory).
The question remains is: in what order should the virtual screening techniques be applied on
drug candidates. To answer this question, the author would like to suggest a hierarchical virtual
screening (Schneider et al., 2008). The techniques can be applied in sequence, from machine
learning (ligand-based technique) to docking (protein-based technique). Machine learning is chosen
for the start because this technique does not involve a CYP450 structure, so its computational cost
is expected to be lower than protein-based techniques. Particular machine learning technique like
decision tree can classify the drug candidates into multiple classes of CYP450 ligands, while SVM
(Support Vector Machine) can deal with non-linear structure-activity relationships which are
encountered for CYP3A4 ligands. When the number of the drug candidates left are small, they can
be analyzed further by QSAR and protein-based techniques. Protein-based techniques are supposed
to offer more accuracies since they involve CYP450 structure in generating their models. With this
hierarchical way of virtual screening, a balance between computational cost and accuracy can be
provided.
37
Acknowledgements
This literature thesis is presented as part of "Drug Discovery and Safety" master program at
the Department of Chemistry & Pharmaceutical Sciences, Faculty of Sciences – Vrije Universiteit,
the Netherlands.
The author expresses his gratitudes to dr. Daan P. Geerke and Prof. dr. Nico P.E. Vermeulen
for their kind supervisions. The Molecular Toxicology Division, the Department of Chemistry &
Pharmaceutical Sciences, and the Faculty of Sciences of Vrije Universiteit is appreciated for all the
facilities which were utilized for making this thesis. ■
38
References
Afzelius, L.; Zamora, I.; Masimirembwa, C.M.; Karlen, A.; Andersson, T.B.; Mecucci, S.; Baroni,
M.; Cruciani, G. 2004. Conformer- and alignment-independent model for predicting
structurally diverse competitive CYP2C9 inhibitors. J. Med. Chem., 47, 907-914.
Appiah-Opong, R.; de Esch, I.; Commandeur, J.N.M.; Andarini, M.; and Vermeulen, N.P.E. 2008.
Structure-activity relationships for the inhibition of recombinant human Cytochrome P450
by curcumin analogues. European Journal of Medicinal Chemistry, 43, 1621-1631.
Arimoto, R. 2006. Computational models for predicting interactions with Cytochrome P450
enzyme. Curr. Top. Med. Chem., 6, 1609-1618.
Arimoto, R.; Prasad, M.-A.; and Gifford, E.M. 2005. Development of CYP3A4 inhibition models:
Comparisons of Machine-Learning techniques and molecular descriptors. Journal of
Biomolecular Screening, 10, 197-205.
Bazeley, P.S.; Prothivi, S.; Struble, C.A.; Povinelli, R.J.; and Sem, D.S. 2006. Synergistic use of
compound properties and docking scores in neural network modeling of CYP2D6 binding:
Predicted affinity and conformational sampling. J. Chem. Inf. Model., 46, 2698-2708.
Boelsterli, U.A. 2009. Mechanistic toxicology – The molecular basis of how chemicals disrupt
biological targets (2nd ed.). Informa Healthcare.
Burton, J.; Danloy, E., and Vercauteren, D.P. 2009. Fragment-based prediction of Cytochrome P450
2D6 and 1A2 inhibition by recursive partitioning. SAR and QSAR in Environmental
Research, 20(3), 185-205.
Burton, J.; Ijjaali, I.; Barberan, O.; Petitet, F.; Vercauteren, D.P.; and Michel, A. 2006. Recursive
partitioning for the prediction of Cytochrome P450 2D6 and 1A2 inhibition: Importance of
the Quality of the Dataset. J. Med. Chem., 49, 6231-6240.
Chohan, K.K.; Paine, S.W.; Mistry, J.; Barton, P.; and Davis, A.M. 2005. A rapid computational
filter for Cytochrome P450 1A2 inhibition potential of compound libraries. J. Med. Chem.,
48, 5154-5161.
Choi, I.; Kim, S.Y.; Kim, H.; Kang, N.S.; Bae, M.A.; Yoo, S.-E.; Jung, J.; and No, K.T. 2009.
Classification models for CYP450 3A4 inhibitors and non-inhibitors. European Journal of
Medicinal Chemistry, 44, 2354-2360.
Chuman, H. 2008. Toward basic understanding of the partition coefficient log P and its application
in QSAR. SAR and QSAR in Environmental Research, 19(1), 71-79.
39
de Groot, M.J. 2006. Designing better drugs: Predicting Cytochrome P450 metabolism. Drug
Discovery Today, 11(13), 601-606.
de Groot, M.J.; Lewis, D.F.V.; and Modi, S. Molecular modeling and Quantitative Structure–
Activity Relationship of substrates and inhibitors of drug metabolism enzymes. In: Taylor,
J.B. and Triggle, D.J. (Editors). 2006. Comprehensive medicinal chemistry II volume 5:
ADME-Toz approaches. Elsevier, Ltd.
Dickmann, L.J.; Locuson, C.W.; Jones, J.P.; and Rettie, A.E. 2004. Differential roles of Arg97,
Asp293, and Arg108 in enzyme stability and substrate specificity of CYP2C9. Mol.
Pharmacol., 65, 842.
Ebalunode, J.O. And Zheng, W. 2010. Molecular shape technologies in drug discovery: Methods
and applications. Curr. Top. Med. Chem., 10, 669-679.
Eitrich, T.; Kless, A.; Druska, C.; Meyer, W.; and Grotendorst, J. 2007. Classification of highly
unbalanced CYP450 data of drugs using cost sensitive Machine Learning techniques. J.
Chem. Inf. Model., 47, 92-103.
Ekroos, M. and Sjogren, T. 2006. Structural basis for ligand promisquity in Cytochrome P450 3A4.
Proc. Natl. Acad. Sci., 103(37), 13682-13687.
Fukunishi, Y.; Hojo, S.; and Nakamura, H. 2006. An efficient in-silico screening method based on
the protein-compound affinity matrix and its application to the design of a focused library
for Cytochrome P450 (CYP) ligands. J. Chem. Inf. Model., 46, 2610-2622.
Freitas, R.F.; Bauab, R.L.; and Montanari, C.A. 2010. Novel application of 2D and 3D-similarity
searches to identify substrates among Cytochrome P450 2C9, 2D6, and 3A4. J. Chem. Inf.
Model., 50, 97-109.
Good, A. Virtual screening. In: Taylor, J.B. and Triggle, D.J. (Editors). 2006. Comprehensive
medicinal chemistry II volume 4: Computer-assisted drug design. Elsevier.
Gunther, S.; Senger, C.; Michalsky, E.; Goede, A.; and Preissner, R. 2006. Representation of target-
bound drugs by computed conformers: Implications for conformational libraries. BMC
Bioinformatics, 7(293).
Haji-Momenian, S.; Rieger, J.M.; Macdonald, T.L.; and Brown, M.L. 2003. Comparative molecular
field analysis and QSAR on substrates binding to Cytochrome P450 2D6. Bioorg. Med.
Chem., 11, 5545-5554.
Hritz, J.; de Ruiter, A.; and Oostenbrink, C. 2008. Impact of plasticity and flexibility on docking
results for Cytochrome P450 2D6: A combined approach of molecular dynamics and ligand
docking. J. Med. Chem., 51, 7469-7477.
Hudelson, M.G.; Ketkar, N.S.; Holder, L.B. Carlson, T.J.; Peng, C.-C. 2008. High confidence
40
predictions of drug-drug interactions: Predicting affinities for Cytochrome P450 2C9 with
multiple computational methods. J. Med. Chem., 51, 648-654.
Hudelson, M.G. And Jones, J.P. 2006. Line-walking method for predicting the inhibition of P450
drug metabolism. J. Med. Chem., 49, 4367-4373.
Ingelman-Sundberg, M.; Oscarson, M.; McLellan, R.A. 1999. Polymorphic human Cytochrome
P450 enzymes: An opportunity for individualized drug treatment. Trends Pharmacol. Sci.,
20(8), 342-349.
Iori, F.; da Fonseca, R.; Ramos, M.J.; and Menziani, M.C. 2005. Theoretical Quantitative Structure
Activity Relationships of flavone ligands interacting with Cytochrome P450 1A1 and 1A2
isozymes. Bioorganic and Medicinal Chemistry, 13, 4366-4374.
Jensen, B.F.; Vind, C.; Padkjaer, S.B.; Brockhoff, P.B.; and Refsgaard, H.H.F. 2007. In-silico
prediction of Cytochrome P450 2D6 and 3A4 inhibition using Gaussian kernel weighted k-
nearest neighbor and extended connectivity fingerprints, including structural fragment
analysis of inhibitors versus noninhibitors. J. Med. Chem., 50, 501-511.
Kapelyukh, Y.; Paine, M.J.I.; Marechal, J.-D.; Sutcliffe, M.J.; Wolf, C.R.; and Roberts, G.C.K.
2008. Multiple substrate binding by Cytochrome P450 3A4: Estimation of the number of
bound substrate molecules. Drug Metabolism and Disposition, 36(10), 2136-2144.
Kirchmair, J.; Distinto, S.; Markt, P.; Schuster, D.; Spitzer, G.M.; Liedl, K.R.; and Wolber, G. 2009.
How to optimize shape-based virtual screening: Choosing the right query and including
chemical information. J. Chem. Inf. Model., 49(3), 678-692.
Kirchmair, J.; Markt, P.; Distinto, S.; Wolber, G.; and Langer, T. 2008. Evaluation of the
performance of 3D virtual screening protocols: RMSD comparisons, enrichment
assessments, and decoy selection — What can we learn from earlier mistakes? J. Comput.
Aided. Mol. Des., 22, 213–228.
Kirton, S.B.; Murray, C.W.; Verdonk, M.L.; and Taylor, R.D. 2005. Prediction of binding modes for
ligands in the Cytochrome P450 and other heme-containing proteins. PROTEINS: Structure,
Function, and Bioinformatics, 58, 836–844.
Koike, A. 2006. Comparison of methods for chemical-compound affinity prediction. SAR and
QSAR in Environmental Research, 17(5), 497-514.
Korhonen, L.E.; Rahnasto, M.; Mahonen, N.J; Wittekindt, C.; Poso, A.; Juvonen, R.O.; and Raunio,
H. 2005. Predictive three-dimensional Quantitative Structure-Activity Relationship of
Cytochrome P450 1A2 inhibitors. J. Med. Chem., 48, 3808-3815.
Laggner, C.; Wolber, G.; Kirchmair, J.; Schuster, D.; and Langer, T. Pharmacophore-based virtual
screening in drug discovery. In: Varnek, A. and Tropsha, A. (Editors). 2008.
41
Chemoinformatics approaches to virtual screening. Royal Society of Chemistry.
Leach, A.R. 2001. Molecular modeling: Principles and applications (2nd ed.). Prentice Hall.
Lewis, D.F.V. 2003. Quantitative Structure-Activity Relationships (QSARs) within the Cytochrome
P450 system: QSARs describing substrate binding, inhibition, and induction of P450s.
Inflammopharmacology, 11(1), 43-73.
Lewis, D.F.V. and Dickins, M. 2002. Substrate SARs in human P450s. Drug Discovery Today,
7(17), 918-925.
Li, H.; Sun, J.; Fan, X.; Sui, X.; Zhang, L.; Wang, Y.; and He, Z. 2008. Considerations and recent
advances in QSAR models for Cytochrome P450-mediated drug metabolism prediction. J.
Comput. Aided. Mol. Des., 22, 843-855.
Li, H.; Yap, C.W.; Ung, C.Y.; Xue, Y.; Li, Z.R.; Han, L.Y.; Lin, H.H.; and Chen, Y.Z. 2007.
Machine-Learning approaches for predicting compounds that interact with therapeutic and
ADMET related proteins. Journal of Pharmaceutical Sciences, 96(11), 2838-2860.
Lin, J.H. and Lu, A.Y. 1998. Inhibition and induction of Cytochrome P450 and the clinical
implications. Clin. Pharmacokinet., 35, 361-390.
Locuson, C.W. And Wahlstrom, J.L. 2005. Three-dimensional Quantitative Structure-Activity
Relationship analysis of Cytochrome P450: Effect of incorporating higher-affinity ligands
and potential new applications. Drug Metabolism and Disposition, 33(7), 873-878.
Mao, B.; Gozalbes, R.; Barbosa, F.; Migcon, J.; Merrick, S.; Kamm, E.; Wong, E.; Costales, C.; Shi,
W.; Wu, C.; and Froloff, N. 2006. QSAR modeling of in-vitro inhibition of Cytochrome
P450 3A4. J. Chem. Inf. Model., 46, 2125-2134.
Meslamani, J.E.; Andre, F.; and Petitjean, M. 2009. Assessing the geometric diversity of
Cytochrome P450 ligand conformers by hierarchical clustering with a stop criterion. J.
Chem. Inf. Model., 49, 330-337.
Michielan, L.; Terfloth, L.; Gasteiger, J.; and Moro, S. 2009. Comparison of multilabel and single
label classification applied to the prediction of the isoform specificity of Cytochrome P450
substrates. J. Chem. Inf. Model., 49, 2588-2605.
Motiejunas, D. and Wade, R.C. Structural, energetic, and dynamic aspects of ligand-receptor
interactions. In: Taylor, J.B. and Triggle, D.J. (Editors). 2006. Comprehensive medicinal
chemistry II volume 4: Computer-assisted drug design. Elsevier.
Mpamhanga, C.P.; Chen, B. McLay, I.M.; Willet, P. 2006. Knowledge-based interaction fingerprint
scoring: A simple method for improving the effectiveness of fast scoring functions. J. Chem.
Inf. Model., 46, 686-698.
Nath, A. and Atkins, W. 2008. Principal Component Analysis of CYP2C9 and CYP3A4 probe
42
substrate/inhibitor panels. Drug Metabolism and Disposition, 36(11), 2151-2155.
Peng, C.C.; Rushmore, T.; Crouch, G.J.; and Jones, J.P. 2008. Modeling and synthesis of novel
tight-binding inhibitors of Cytochrome P450 2C9. Bioorg. Med. Chem., 16, 4064-4074.
Perez, J.J. 2005. Managing molecular diversity. Chemical Society Reviews, 34, 143-152.
Polgar, T.; Menyhard, D.K.; and Keseru, G.M. 2007. Effective virtual screening protocol for
CYP2C9 ligands using a screening site constructed from flurbiproven and S-warfarin
pockets. J. Comput. Aided Mol. Des., 21, 539-548.
Putta, S. and Beroza, P. 2007. Shapes of things: Computer modeling of molecular shape in drug
discovery. Curr. Top. Med. Chem., 7, 1514-1524.
Ridderstrom M.; Masimirembwa, C.; Trump-Kallmeyer, S.; Ahlefelt, M.; Otter, C.; and Andersson,
T.B. 2000. Arginines 97 and 108 in CYP2C9 are important determinants of the catalytic
function. Biochem. Biophys. Res. Commun., 270, 983.
Ringsted, T.; Nikolov, N.; Jensen, G.E.; Wedebye, E.B.; and Niemela, J. 2009. QSAR models for
P450 (2D6) substrate activity. SAR and QSAR in Environmental Research, 20(3), 309-325.
Rock, D.; Wahlstrom, J.; and Wienkers, L. Cytochrome P450s: Drug-drug interactions. In: Vaz,
R.J. and Klabunde, T. (Editors). 2008. Antitargets. WILEY-VCH Verlag GmbH & Co.
KgaA, Germany.
ROCS 3.0.0 Manual. Http://www.eyesopen.com/docs/rocs/3.0.0/pdf/ROCS.pdf, accessed on May
2010.
Rose, J.R. Machine Learning techniques in chemistry. In: Gasteiger, J. (Ed.). 2003. Handbook of
Chemoinformatics. Wiley-VCH.
Rowland, P.; Blaney, F.E.; Smyth, M.G.; Jones, J.J.; Leydon, V.R.; Oxbrow, A.K.; Lewis, C.J.;
Tennant, M.G.; Modi, S.; Eggleston, D.S.; Chenery, R.J.; Bridges, A.M. 2006. Crystal
structure of human cytochrome P450 2D6. J.Biol.Chem., 281, 7614-7622.
Roy, K. and Roy, P.P. 2008. Comparative QSAR studies of CYP1A2 inhibitor flavonoids using 2
and 3D descriptors. Chem. Biol. Drug Des., 72, 370-382.
Sansen, S.; Yano, J.K.; Reynald, R.L.; Schoch, G.A.; Griffin, K.J.; Stout, C.D.; and Johnson, E.F.
2007. Adaptations for the oxidation of polycyclic aromatic hydrocarbons exhibited by the
structure of human P450 1A2. J. Biol. Chem., 282, 14348-14355.
Santos, R.; Hritz, J.; and Oostenbrink, C. 2010. Role of water in molecular docking simulations of
Cytochrome P450 2D6. J. Chem. Inf. Model., 50, 146-154.
Schneider, G. and Baringhaus, K.-H. 2008. Molecular design: Concepts and applications. WILEY
VCH Verlag GmbH & Co. KgaA, Germany. J
43
Stjernschantz, E.; Vermeulen, N.P.E.; and Oostenbrink, C. 2008. Computational prediction of drug
binding and rasionalisation of selectivity towards Cytochrome P450. Expert Opin. Drug
Metab. Toxicol., 4(5), 513-527.
Schuster, D.; Laggner, C.; Steindl, T.M.; and Langer, T. 2006. Development and validation of an in
silico P450 profiler based on pharmacophore models. Current Drug Discovery
Technologies, 3, 1-48.
Sykes, M.J.; McKinnon, R.A.; and Miners, J.O. 2008. Prediction of metabolism by Cytochrome
P450 2C9: Alignment and docking studies of a validated database of substrates. J. Med.
Chem., 51, 780-791.
Todeschini, R. and Consonni, V. 2000. Handbook of molecular descriptors. Wiley-VCH.
Triballeau, N.; Bertrand, H.-O.; and Acher, F. Are You Sure You Have a Good Model? In: Langer, T.
and Hoffmann, R.D. (Editors). 2006. Pharmacophores and Pharmacophore Searches.
WILEY-VCH Verlag GmbH & Co. KgaA, Germany.
Van Looy, S.; Verplancke, T.; Benoit, D.; Hoste, E.; van Maele, G.; de Turck, F.; Decruyenaere, J.
2007. A novel approach for prediction of tacrolimus blood concentration in liver
transplantation patients in the intensive care unit through support vector regression. Critical
Care, 11(R83).
Vasanthanathan, P.; Olsen, L.; Jorgensen, F.S.; Vermeulen, N.P.E.; and Oostenbrink, C. 2010.
Computational prediction of binding affinity for CYP1A2-ligand complexes using empirical
free energy calculations. Drug Metabolism and Disposition, 38(7) (E-pub ahead of print).
Vasanthanathan, P.; Taboureau, O.; Oostenbrink, C.; Vermeulen, N.P.E.; Olsen, L.; and Jorgensen,
F.S. 2009. Classification of Cytochrome P450 1A2 inhibitors and noninhibitors by Machine
Learning techniques. Drug Metabolism and Disposition, 37(3), 658-664.
Veith, H.; Southall, N.; Huang, R.; James, T.; Fayne, D.; Artemenko, N.; Shen, M.; Inglese, J.;
Austin, C.P.; Lloyd, D.G.; Auld, D.S. 2009. Comprehensive characterization of Cytochrome
P450 isozyme selectivity across chemical libraries. Nature Biotechnology, 27(11), 1050-
1057.
Verdonk, M.L.; Cole, J.C.; Hartshorn, M.J.; Murray, C.W.; and Taylor, R.D. 2003. Improved
protein-ligand docking using GOLD. PROTEINS: Structure, Function, and Genetics, 52,
609–623.
Vistoli, G. and Pedretti, A. Molecular fields to assess recognition forces and property spaces. In:
Taylor, J.B. and Triggle, D.J. (Editors). 2006. Comprehensive medicinal chemistry II
volume 5: ADME-Tox approaches. Elsevier.
Williams, J.A.; Hyland, R.; Jones, B.C.; Smith, D.A.; Hurst, S.; Goosen, T.C.; Peterkin, V.; Koup,
44
J.R.; Ball, S.E. 2004. Drug-drug interactions for UDP-glucuronosyltransferase substrates: A
pharmacokinetic explanation for typically observed low-exposure (AUCI/AUC) ratios.
Drug Metabolism and Disposition, 32(11), 1201-1208.
Yasuo, K.; Yamaotsu, N.; Gouda, H.; Tsujishita, H.; Hirono, S. 2009. Structure-based CoMFA as a
predictive model – CYP2C9 inhibitors as a test case. J. Chem. Inf. Model., 49, 853-864.
Yamashita, F.; Hara, H.; Ito, T.; and Hasida, M. 2008. Novel hierarchical classification and
visualization method for multiobjective optimization of drug properties: Application to
Structure-Activity Relationship analysis of Cytochrome P450 metabolism. J. Chem. Inf.
Model., 48, 364-369.
Yap, C.W. And Chen, Y.Z. 2005. Prediction of Cytochrome P450 3A4, 2D6, and 2C9 inhibitors and
substrates by using Support Vector Machine. J. Chem. Inf. Model., 45, 982-992.
Young, D.C. 2009. Computational drug design: A guide for computational and medicinal chemists.
John Wiley & Sons, Inc.
Zlokarnik, G.; Grootenhuis, P.D.; Watson, J.B. 2005. High throughput P450 inhibition screens in
early drug discovery. Drug Discovery Today, 10(21), 1443-1450.
45

Thesis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Thesis

Uploaded by

Copyright:

Available Formats

Literature thesis

Andrianopsyah Mas Jaya Putra

Department of Chemistry & Pharmaceutical Sciences

2 Docking of CYP450 ligands ..................................................................................... 12

3 Shape-matching, pharmacophore-matching, and field calculation of CYP450

4 QSAR and classification of CYP450 ligands ........................................................... 28

5 Conclusions and perspectives ................................................................................... 37

Keywords: Cytochrome P450, virtual screening,

1.1. The importance of virtual screening of CYP450 ligands

The computational approach to predict a compound's affinity / activity to a particular target

Table 1.2. Classification of virtual screening techniques

Docking Shape-matching QSAR

Since a virtual screening model is just an approach or product of approximation to real

Figure 1.3. Evolutionary relationships between several human CYP450s

Table 1.4. Characteristics of several CYP450s and their substrates in general

CYP Relative volume General characteristics of substrates

1.3. Validation of virtual screening model

A virtual screening model should be validated to see if it gives correct predictions of

Figure 1.4. Validation of a virtual screening model (Kirchmair et al., 2008).

ΔG0bind = 2.303 RT log Ki (Equation 2.1)

ΔGbind = a + b Shbond + c Smetal + d Slipo + e Hrot (Equation 2.2)

2.1. Effect of CYP450 structure on docking of CYP450 ligands

Figure 3.1. Left: CYP1A2 (blue) in complex with 2-phenyl-4H-benzo(H)chrome-4-one (yellow)

Shape-matching can be done in two ways: ligand-based and protein-based. In ligand-based

Shape = Overlap Gaussians

3.2. Pharmacophore-matching of CYP450 ligands

3.3. Field calculation of CYP450 ligands

Table 3.2. CoMFA statistics of CYP2C9 inhibitors in Peng et al. (2008)

Shape-matching, pharmacophore-matching, and field calculation are virtual screening

4.1. QSAR of CYP450 ligands

4.2. Classification of CYP450 ligands by machine learning

4.2.1. Classification of CYP450 ligands with Support Vector Machine (SVM)

Percentage of selected types of descriptors

4.2.2. Classification of CYP450 ligands with decision tree

You might also like