3D Qsar

3D-QSAR
3 - Dimensional
Quantitative Structure Activity Relationship
Unit-IV
Contents
1. Conceptual Introduction
2. Normal Procedure of CoMFA
3. More Considerations
(CoMFA)
4. Beyond CoMFA
(CoMFA/CoMFA)
Multiple Simultaneous Modeling MFA
More Explicit Receptor RSA
Lock & Key 5D QSAR (Induced Fit)
More Graphical Understanding, Variable Selection GOLPE
Pharmacophore Modeling HypoGen
Active Conformer/Superposition Topomer CoMFA
5. 3D-QSAR
3D-QSAR Introduction
Field
CoMFA
PLS
Cross-validation
Experiments
What to do?
(3D QSAR)
Find 3D Information
Molecular Representation
benzene 1D
2D
3D
ensemble of conformations, orientations,
protonation states 4D
3D-QSAR Introduction
What is CoMFA?
CoMFA can help

CoMFA Procedure
Basic descriptor: Lennard-Jones 6-12 potential and Coulomb potential.
CoMFA is very dependent on the alignment rule.

CoMFA Procedure
Goal: Understanding the Receptor
From Dots(Molecular Fields) to Receptor Information
Comparison of Actives Pseudo-Receptor Model

CoMFA Goal
Binding affinity Prediction:
predict binding affinities for new (but similar) ligands based on a CoMFA
model derived from the training set molecules.
Contour map analysis Lead optimization:

Calculate contour surfaces based on the regression coefficients obtained
from PLS analysis (e.g. colored SAS-representation of ligand).
Steric Properties
Electrostatic Properties
(3D) QSAR
Mostly, Finding Good Model for binding constant-related

properties using 3D information (descriptors).
SA
Volume KI
Steric Field IC50
Electrostatic Field,.... MIC
Other: Polar Surface Area ADME property

Field(): A region affected by a physical force
ex) electrostatic, magnetic, gravitational, steric
Can be calculated easily from the structure.
ex) How big is electrostatic field?
q
V E r
r
CoMFA Comparative Molecular Field Analysis
CoMFA Assumptions
Bioactive one (not metabolite, etc)
Proposed geometry is a bioactive conformation.
Binding involve mainly a single conformation.
Binding is the same for all the modeled compounds.
Mainly enthalpic, not entropic.
Equilibrium (kinetics are usually not considered.)
Solvent effect, diffusion, transport (not always necessary)

Statistical Background
Classical Methods
One parameter : Linear Regression
Many Parameters : Multilinear Regression
Multilinear Regression
MLR was developed to deal with situations in which the number of objects (N) is five
times at least larger than the number of variables (X).
This inconvenience can be overcome by using stepwise MLR,

High probability to obtain relationships just by chance.
MLR assumes that the X-variables are "independent" and uncorrelated.

Statistical Background
PLS analysis:
Belongs to the family of PCR (principal component
regressions) techniques.
Use of principal component analysis in regression:
First reduction of X and/or Y matrices in principal
components also called latent variables (LVs).
Secondly, regression between these latent variables.
Different types:
PCR : reduction of X matrix only and regression with Y
variable.
PLS : reduction of X matrix using the variability of Y
matrix.
Partial Least Square Analysis (PLS)
Y vector X matrix
BA1 S11 S12 S13 S14 . . . . . . . S1m E11 E12 . . . . . E1m
BA2 S21 S22 S23 S24 . . . . . . . S2m E21 E22 . . . . . E2m
BA3 S31 S32 S33 S34 . . . . . . . S3m E31 E32 . . . . . E3m PLS vectors
: : : : : : : : :
: : : : : ....... : : : ..... : PCA vectors
: : : : : : : : :
: : : : : : : : :
BAn Sn1 Sn2 Sn3 Sn4 . . . . . . . Snm En1 En2 . . . . . Enm
PLS SAMPLS
analysis analysis
U11 t11 t21 c11 c21 c31 . . . . cn1

U12 t12 t22 c21 c22 c32 . . . . cn2
U13 t13 t23 c31 c32 c33 . . . . cn3
: : : : Confidence
: : : :
: : : : volumes Of PC
: : : :
U1n t1n t2n cn1 cn2 cn3 . . . . cnn
PLS vectors Covariance matrix

(latent variables)
Cross-validated PLS analyses
Original Groups of
Table crossvalidation
PRESS
Derivation
of a model Differences
Compounds Predicted
excluded Activity
Measured
Prediction of excluded Activity
compounds
LOO: Leave-one out, n times q2

Cross-validation
Crossvalidated r2 (q2: LOO method)

PRESS ( y predicted yactual ) 2
1.00 = Perfect prediction
SSD ( yactual y ) 2
Significant statistical results
PRESS
q 1.0
2
0.50 (?) SSD
Use results only with care when q2 > 0.4)
0.00 = No Model!
Negative values = prediction worse than

those based on the mean over all compounds !
Which model is a better model?
5
(A) Straight Line: Simple Fitti

4
Y=aX+b
3
Weight/Height
2
(B) Curved Line: Perfect Fittin
1
Y=aX3+bX2+cX+d
0 phosphoric acid charge/pH
0 1 2 3 4 5
Cross-validation
Choice of optimal number of components: principal source of

overfitting in PLS analyses.
Graphs q2 vs number of components help the selection!
0.7 1.2
0.6 1
0.5
r^2 fnal
0.8
0.4
Q^2
0.6
0.3
0.2 0.4
0.1 0.2
0 0
0 5 10 15 0 5 10 15
Number of components Number of components
Principal rule: have more than 5 observations by component !

Training Set Selection
A wide range of structurally diverse compounds
The range and the distribution of biological data
Congeneric Series (to the same receptor with the

same binding mode, the same active site)
Diverse vs Congeneric
We usually dont know the real situation!

N CN
N CN N
N
O CN
R Y X
Y X
R N N
CN O Set 3 : 18 compounds
R O
N N
O O
O Set 2 : 7 compounds
R
Set 1 : 16 compounds R
X R' O O O
Y R Set 6 : 22 compounds
N
OR N N
R'
N X O
O
R Y
O R
Set 5 :12 compounds O
Set 4 : 3 compounds Set 7 : 21 compounds
Sets 4 and 7:
10 not enough active (7)
9 or inactive (5)
compounds.
pIC50 MAO A
8
Sets 1, 2, 3 and 5:
7 poor distribution of
6 biological activities.
5
Set 6:
4 Broad range and
3 relatively well
distributed biological
0 1 2 3 4 5 6 7 8
activities
Sets (congeneric)
Statistical analyses for training set 6
Analysis Field(s) q2 N r2 s F ste ele lip

A S .743 3 .894 .522 50.4 100 - -
B E .433 1 .547 1.02 24.2 - 100 -
C L .598 3 .812 .694 25.8 - - 100
D S +E .594 2 .790 .713 35.8 45.1 54.9 -
E S +L .673 3 .870 .576 40.3 40.6 - 59.4
F E +L .474 2 .682 .878 20.4 - 44.9 55.1
G S +E +L .570 2 .760 .763 30.1 22.8 35.1 42.1
N: number of components
Statistical analyses for training set 1, 3, 5 and 6
Q2 from SAMPLS using a Leave One Out (LOO) procedure

Field(s) Sets 1 + 4 Sets 2+4 Sets 3+4 Sets

1+2+3
S 0.645 (1) 0.872 (1) 0.778 (2) -0.035 (1)
E 0.786 (2) 0.831 (1) 0.840 (2) 0.198 (3)
L 0.637 (2) 0.792 (1) 0.735 (2) -0.045 (1)
S+E 0.728 (1) 0.854 (1) 0.816 (2) 0.212 (4)
S+L 0.677 (1) 0.848 (1) 0.780 (2) -0.034 (1)
E+L 0.856 (3) 0.828 (1) 0.787 (2) 0.211 (4)
S+E+L 0.771 (1) 0.846 (1) 0.797 (2) 0.023 (1)
By combining sets 1, 2 and 3, no CoMFA model was found, presumably

due to the poor distribution of activities.
However, by combining set 4 with either set 1, 2 or 3, CoMFA produced
surprisingly good statistical models.
Training set selection
Statistics are markedly improved when set 4 (only 3 compounds of

high activity !) was added. However, it appears that the activities can
be separated in two clusters (poorly active and highly active
compounds). It is thus trivial to find good linear models (a
straight line via two points !).
S
The leave-one-out procedure was not able to

detect this pitfall.
When the crossvalidation was performed using
groups of crossvalidation, the q2 vary from very
good (> 0.8) to very bad (<-0.5, when all active
compounds were removed!).
CONCLUSION
The choice of the training set is of prime importance.
The division of training/test set is necessary for model

validation.
This is an example of correct alignment
in a receptor (ACE).
Molecular Field Calculation
Traditional CoMFA fields
Steric fields, Lennard-Jones potential

r
12
rprobe + rk
6

probe + rk
natoms
Ej =

rij

-2
rjk

k =1

Electrostatic fields, Coulomb potential

natoms q probe q k
Ej =
k = 1 r jk
Interpretation of CoMFA Results
Compounds with low activity Compounds with high activity
Steric
-
Electrostatic
Steric N +
+
Lipophilic
Polar
Electrostatic
-
Model Validation
Within the Training Set
Model Fitting
R2 : the square of correlation coefficient
RMSE : root mean square error
ANOVA F value, p value
Internal Predictability
Cross-validated R2, Bootstrap R2 , y-Scrambling
With the Test Set

External Predictability
Outside the training/test set

Real Predictability (?)
Outliers
Three aspects
Leverage = How far from Reds
Discrepancy = Out of line with others
Influence = combined leverage and discrepancy
Hi Leverage
Lo Leverage Hi Discrepancy
Hi Discrepancy Hi Influence
Mod Influence
Hi Leverage
Lo Discrepancy
Mod Influence
Residual Plot
Residual = observed Y predicted Y
Pattern
(a) ideal
(b) failure of normality -> grouping or recollection
(c) nonlinearity -> nonlinear fitting
(d) heteroscedasticity -> transformation Pattern
Detection of Outliers
Typical CoMFA Parameters
For the step of crossvalidation:

Cross-validated correlation coefficient, q2.
Optimal number of components, N.
For the final model:

Squared of correlation coefficient, r2.
Standard error of estimate, s.
Residuals.
F values.
Recommended Statistical Values for QSAR
Statistical parameter Statistical value
R2 >0.8
R2CV or Q2 >0.5
(Cross-validation R2)
Descriptors > 5 compounds per descriptor or
components
Data points At least 10, optimally above 20
Data range At least two orders of magnitude
Podlogar, B.; Meugge, I. And Brice, J. Curr. Opin. Drug Discovery Dev., 2001, 4(1), 102-109.
Summary: Factors Affecting CoMFA
Statistics
Training/Test Set Selection

Outlier Removal?
Variable Selection
Validation Methods
Others
Field Calculation Method

Summary: Factors Affecting CoMFA
NH2
Geometry
Bioactive Conformation? HO
Geometry Optimization Method?

Superposition between Actives?
Orientation against Lattice? 35 = 243
Lattice Size?
Probe Atom Charge and Size?
3. More Considerations
Cutoff, H-bonding CoMSIA

Conformer Generation & Superposition, Field Calculation,
Orientation Against Lattice ( # of points in Lattice )
Variable Selection Procedure (Regional Focusing, GOLPE)
Hard Potential (CoMFA) Soft Potential (CoMSIA)
Cutoff: 5 kcal/mol for steric, 30 kcal/mol for Coulomb

Improvement of CoMSIA over CoMFA
Soft Potential (Less affected by Lattice Point)
Contiguous Map (includes atomic points)
Entropic Contribution (Hydrophobic)
Less Conformation, Distance Dependency
No Arbitrary Cutoff and Scaling is needed.
H-bonding (important in chemistry)

Superposition Methods
( Manual / GA ) with ( Atom / Pharmacophore )

MCS detection
MCS: Maximal Common Subgraph

Field Fit
(dissimilar compounds)
(non atom center)
Molecular Skin
(skin volume can be changed

works well with significant different size)
Regional Overlap
weighing contributions of
one molecular region with
this overlap
resulting superposition can

be close to experiments
Field Calculation Methods
Additional fields in other software

Interaction energies (GRID/GOLPE)
hydrophobic field (HINT)
Molecular Lipophilicity Field (CLIP)
Grid-distributed target property (HASL)
etc. etc
Charge Calculation
Semiempirical/ab initio calculation followed

by population analysis
electrostatic potential fitted charge
Other empirical charges
Variable Selection Methods
Why Select Variables? ( Selection of Grid Points )
CoMFA is a very underdetermined method.

Less is better.
Minimizes the risk of chance effects
More predictive model
better interpretability
Essentially Subjective Procedure

GOLPE (Generating Optimal Linear PLS Estimations)
Guided Region Selection
Polyhedra
Coalescing
Conclusion of CoMFA
Many Choices
Training/Test set
Conformer Generation
Superposition
Field Calculation
Variable Selection
Statistical Validation
test several combination (all are related).
What is the best combination of choices?

Ultimately by experimental verification
practically by statistical
parameters
Beyond CoMFA
Advantages and Disadvantages of CoMFA & CoMSIA
Compared with 2D QSAR:
Good 3D Information
Can test Dissimilar Compounds
Problems and Some Improvements
Many models at once GFA

Not Really Interacting with a Certain Receptor RSM
Static (Lock & Key) 5D QSAR
Difficult to Handle Flexible Molecules HypoGen
Difficult for Virtual Screening Topomer CoMFA
GFA (Genetic Function Approximation: Cerius)
Combination of GA (Genetic Algorithm )

and MARS (Multiple Adaptive Regression Spline)
GA for multiple solutions
(GA is a population optimization method.)
Can test many models simultaneously
MARS for nonlinear regression
Various basis functions (splines, Gaussians, polynomials)
Linear regression is also possible.
Outputs of the multiple models can be averaged to gain
additional predictability.
Splines can be interpreted as range identification or
outlier removal.
Interpretations are often very difficult when nonlinear.

LOF(Lack of Fit) by Friedman
LSE
LOF 2
c dp
1
M
c : number of basis function

d : smoothing parameter (user definition)
p : number of feature
M : number of training molecule
Prevents over-fitting (minimum)
Fitness function for GA
MFA (Molecular Field Analysis: Cerius)
Very Similar to CoMFA
Data Evaluation by GFA
Rectangular, Spherical, Random Grid

Receptor Surface Model ( Pseudo-Receptor )
A model that characterizes the putative active site surface
Alignment of most active ligands bioactive conformation
Marching cubes isosurface algorithm

(triangulated surface points: 6 points / A2)
O
(A) (B)
(A) Van der Waals field function 1
V (r ) r R
V(r) 0
(B) Wyvill field function V (V DW )

V (W y v ill)
4 r 6 17 r 4 22 r 2 -1
V (r ) 6
4
2
1 0 1R 2R
9R 9 R 9 R
r
Receptor Surface Model ( Features & Usage )
Surface Properties
Partial charges, electrostatic potential, hydrogen bonding
propensity, hydrophobicity
Model Usage
Open / Closed model
Structures can be energy minimized within the receptor
surface model (Alignment, Docking)
3D-QSAR,
Virtual Screening (Catalyst/CatShape)
de novo design
Same Molecules, Different models, Different Predictions
< CoMFA > < RSM >

Fitting to Average Ligand Fitting to Average Receptor
4D QSAR Introduction
Quasar(Quasi-atomistic receptor modeling)
receptor site by a 3D envelope (steric nature)

(by many superimposed active structures)
properties as each surface points ( r=0.8 angstrom )

(hydrophobicity, partial charge, electrostatic potential,
hydrogen bond propensity)
4th dimension ensemble of conformation, orientation,

protonation states of ligands
ex) 3D + time by Albert Einstein
< local induced fit and H-bond flip flop >

4D QSAR Induced Fit Calculation
Number of property points are equal, but assigned

properties will be adapted for each molecule
OH
Induced fit is calculated by positional constraints

O
0.1kcal/molA2
4D QSAR Examples
Different ligand adaptation
Points on the Envelope

4D QSAR Procedure
Aligned Congeneric Actives Analysis of Model Family
steric nature
Estimation of Free Energy
Primordial Envelope of Ligand Binding
Ligand repositioning
Evolution of the Points
GA
Averaged Receptor Envelope Initial Family of Points
Fitness Function: LOF(RMS of Gpred vs Gexp)

5D QSAR Adding One More Dimension
4D QSAR plus
5th dimension the adaptation of the receptor binding

pocket to the individual ligand topology
- six different induced-fit protocols for
adapting the mean envelopes
- reduces bias associated with the choice
of bioactive conformation, alignment,
and induced fit model
- resulting geometry may be more absurd

Over-induced Fit
Coping with Conformational Explosion:
Until now, no issue of conformational explosion.

In cases where there are
many possible conformations, this becomes very annoying.
Consider all the conformations and alignments

as long as possible
HypoGen
Ignore all the conformations and alignments by selecting

only one conformation and one alignments
Topomer CoMFA
HypoGen Conformer Generation
To consider conformationally complex molecules,

conformational sampling is necessary (upto 250).
Poling Algorithm
(using potential for conformational diversity)
If a new conformer is similar to an existing conformer,
this one is penalized by a potential.
Ligands are represented by
H
Reduction of
Number of Points D A
< CoMFA > < HypoGen >

Lattice Points Pharmacophore Points
over 1000 points below 10 points

HypoGen Definitions
Pharmacophore A set of features and 3D orientation
Features Functional Groups of the same kind
Most Active Group MA X UncMA A / UncA > 0.0
Least Active Group log(A) log(MA) > 3.5
H H
D A D A
HypoGen Theory
Constructive Phase Pharmacophore Domain
Feasible Models
Subtractive Phase
Top
Scoring
Optimization Phase
Models
HypoGen Constructive Phase
1. Training Set
2. Identify the most active compounds
(Most active compounds)
3. Enumerate all possible

pharmacophore configurations Second Most
Active
4. Check for duplicates
5. Ensure that the rest of most active fit to

MinSubsetPoint Features (Reduce # of hypos)
HypoGen Substractive Phase
1. Training Set
2. Identify the least active compounds
(Least active compounds)
3. Enumerate all possible

pharmacophore configurations
Next Least
Active
4. Check for configurations shared with
the most active compounds
5. Eliminate if shared by more than half

of the least actives
HypoGen Optimization Phase
(From Feasible Pharmacophores)
1. Features and/or locations are varied to

optimize activity prediction via
SA(simulated annealing) approach
2. Geometric fits are calculated
3. Linear regression of log(activity) vs Geometric Fit
4. Total cost is calculated for each new hypothesis

(Best hypothesis has the lowest cost)
The fitness error (against activity data)

weight error (against feature weights)
HypoGen pros & cons
Can handle conformationally complex molecules

(upto 250 conformations, tetrapeptide)
Can be used as an input geometry such as CoMFA

(it is also one of the alignment methods)
Activity information is utilized for conformer/superposition
Provides reasonable selection criteria among many

hypotheses
SAR rather than QSAR

(No electrostatic, steric information is considered.)
Topomer CoMFA alignment
Adjust Torsions:
(1,3),(1,2),(5,8),(10,14)
Combi-Chem library
Direction: pointing away

15
from the FIT atoms
A Fixed Core
(14)O
10
(4) 9
1 S 11
2 3 8
5
12
7 6
13
Calculate steric field
like CoMFA
Topomer CoMFA Features
Single Conformation
Automatic Alignment & Superposition
Binned Field Values reducing the # of fields
Steric Fields: 0, 2, 4, 6, 8, .., 30 kcal/mol

Electrostatic Fields: -13, -11, .., 11, 13 kcal/mol
Attenuation Factor (0.85)

Topomer CoMFA Features
Attenuation Factor (0.85)
15
0.852
A Fixed Core
(14)O
10
(4) 9
1 S 11
2 3 8
5
12
13
7 6
1 0.85
Topomer CoMFA pros & cons
Fast
Virtual Screening Conscious
Lead Optimization
Lead Hopping (Lead Generation)
Is this really true? (high q2)

Requires common moiety (for fixed core)
More false positives
SAR rather than QSAR?
Conclusion:
3D QSAR methods:
Very robust
Lead optimization
For Virtual Screening
3D QSAR 3D SAR

3D Qsar

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

3D Qsar

Uploaded by

Copyright:

Available Formats

3D-QSAR

2. Normal Procedure of CoMFA

CoMFA can help

CoMFA is very dependent on the alignment rule.

Comparison of Actives Pseudo-Receptor Model

Contour map analysis Lead optimization:

Mostly, Finding Good Model for binding constant-related

Other: Polar Surface Area ADME property

ex) electrostatic, magnetic, gravitational, steric

Can be calculated easily from the structure.

ex) How big is electrostatic field?

Bioactive one (not metabolite, etc)

Proposed geometry is a bioactive conformation.

Binding involve mainly a single conformation.

Binding is the same for all the modeled compounds.

Mainly enthalpic, not entropic.

Equilibrium (kinetics are usually not considered.)

Solvent effect, diffusion, transport (not always necessary)

This inconvenience can be overcome by using stepwise MLR,

MLR assumes that the X-variables are "independent" and uncorrelated.

U11 t11 t21 c11 c21 c31 . . . . cn1

PLS vectors Covariance matrix

LOO: Leave-one out, n times q2

Crossvalidated r2 (q2: LOO method)

0.50 (?) SSD

Use results only with care when q2 > 0.4)

Negative values = prediction worse than

(A) Straight Line: Simple Fitti

Choice of optimal number of components: principal source of

Principal rule: have more than 5 observations by component !

A wide range of structurally diverse compounds

The range and the distribution of biological data

Congeneric Series (to the same receptor with the

We usually dont know the real situation!

Statistical analyses for training set 6

Analysis Field(s) q2 N r2 s F ste ele lip

Statistical analyses for training set 1, 3, 5 and 6

Q2 from SAMPLS using a Leave One Out (LOO) procedure

Field(s) Sets 1 + 4 Sets 2+4 Sets 3+4 Sets

By combining sets 1, 2 and 3, no CoMFA model was found, presumably

Statistics are markedly improved when set 4 (only 3 compounds of

The leave-one-out procedure was not able to

The choice of the training set is of prime importance.

The division of training/test set is necessary for model

Traditional CoMFA fields

Steric fields, Lennard-Jones potential

Electrostatic fields, Coulomb potential

Compounds with low activity Compounds with high activity

With the Test Set

Outside the training/test set

(b) failure of normality -> grouping or recollection

(c) nonlinearity -> nonlinear fitting

(d) heteroscedasticity -> transformation Pattern

For the step of crossvalidation:

For the final model:

Statistical parameter Statistical value

Data range At least two orders of magnitude

Training/Test Set Selection

Field Calculation Method

Geometry Optimization Method?

Cutoff, H-bonding CoMSIA

Cutoff: 5 kcal/mol for steric, 30 kcal/mol for Coulomb

Soft Potential (Less affected by Lattice Point)