You are on page 1of 83

3D-QSAR

3 - Dimensional
Quantitative Structure Activity Relationship

Unit-IV
Contents
1. Conceptual Introduction

2. Normal Procedure of CoMFA

3. More Considerations
(CoMFA)

4. Beyond CoMFA
(CoMFA/CoMFA)
Multiple Simultaneous Modeling MFA
More Explicit Receptor RSA
Lock & Key 5D QSAR (Induced Fit)
More Graphical Understanding, Variable Selection GOLPE
Pharmacophore Modeling HypoGen
Active Conformer/Superposition Topomer CoMFA

5. 3D-QSAR
3D-QSAR Introduction

Field
CoMFA
PLS
Cross-validation
Experiments

What to do?
(3D QSAR)

Find 3D Information
Molecular Representation
benzene 1D

2D

3D
ensemble of conformations, orientations,
protonation states 4D
3D-QSAR Introduction
What is CoMFA?

CoMFA can help


CoMFA Procedure
Basic descriptor: Lennard-Jones 6-12 potential and Coulomb potential.

CoMFA is very dependent on the alignment rule.


CoMFA Procedure
Goal: Understanding the Receptor
From Dots(Molecular Fields) to Receptor Information

Comparison of Actives Pseudo-Receptor Model


CoMFA Goal
Binding affinity Prediction:
predict binding affinities for new (but similar) ligands based on a CoMFA
model derived from the training set molecules.

Contour map analysis Lead optimization:


Calculate contour surfaces based on the regression coefficients obtained
from PLS analysis (e.g. colored SAS-representation of ligand).

Steric Properties

Electrostatic Properties
(3D) QSAR

Mostly, Finding Good Model for binding constant-related


properties using 3D information (descriptors).

SA
Volume KI
Steric Field IC50
Electrostatic Field,.... MIC

Other: Polar Surface Area ADME property


Field(): A region affected by a physical force

ex) electrostatic, magnetic, gravitational, steric

Can be calculated easily from the structure.

ex) How big is electrostatic field?

q
V E r
r
CoMFA Comparative Molecular Field Analysis
CoMFA Assumptions

Bioactive one (not metabolite, etc)

Proposed geometry is a bioactive conformation.

Binding involve mainly a single conformation.

Binding is the same for all the modeled compounds.

Mainly enthalpic, not entropic.

Equilibrium (kinetics are usually not considered.)

Solvent effect, diffusion, transport (not always necessary)


Statistical Background

Classical Methods
One parameter : Linear Regression
Many Parameters : Multilinear Regression

Multilinear Regression

MLR was developed to deal with situations in which the number of objects (N) is five
times at least larger than the number of variables (X).

This inconvenience can be overcome by using stepwise MLR,


High probability to obtain relationships just by chance.

MLR assumes that the X-variables are "independent" and uncorrelated.


Statistical Background

PLS analysis:
Belongs to the family of PCR (principal component
regressions) techniques.
Use of principal component analysis in regression:
First reduction of X and/or Y matrices in principal
components also called latent variables (LVs).
Secondly, regression between these latent variables.
Different types:
PCR : reduction of X matrix only and regression with Y
variable.
PLS : reduction of X matrix using the variability of Y
matrix.
Partial Least Square Analysis (PLS)

Y vector X matrix
BA1 S11 S12 S13 S14 . . . . . . . S1m E11 E12 . . . . . E1m
BA2 S21 S22 S23 S24 . . . . . . . S2m E21 E22 . . . . . E2m
BA3 S31 S32 S33 S34 . . . . . . . S3m E31 E32 . . . . . E3m PLS vectors
: : : : : : : : :
: : : : : ....... : : : ..... : PCA vectors
: : : : : : : : :
: : : : : : : : :
BAn Sn1 Sn2 Sn3 Sn4 . . . . . . . Snm En1 En2 . . . . . Enm

PLS SAMPLS
analysis analysis

U11 t11 t21 c11 c21 c31 . . . . cn1


U12 t12 t22 c21 c22 c32 . . . . cn2
U13 t13 t23 c31 c32 c33 . . . . cn3
: : : : Confidence
: : : :
: : : : volumes Of PC
: : : :
U1n t1n t2n cn1 cn2 cn3 . . . . cnn

PLS vectors Covariance matrix


(latent variables)
Cross-validated PLS analyses

Original Groups of
Table crossvalidation
PRESS

Derivation
of a model Differences

Compounds Predicted
excluded Activity

Measured
Prediction of excluded Activity
compounds

LOO: Leave-one out, n times q2


Cross-validation

Crossvalidated r2 (q2: LOO method)


PRESS ( y predicted yactual ) 2
1.00 = Perfect prediction

SSD ( yactual y ) 2
Significant statistical results
PRESS
q 1.0
2

0.50 (?) SSD

Use results only with care when q2 > 0.4)

0.00 = No Model!

Negative values = prediction worse than


those based on the mean over all compounds !
Which model is a better model?
5

(A) Straight Line: Simple Fitti


4

Y=aX+b
3

Weight/Height
2
(B) Curved Line: Perfect Fittin
1
Y=aX3+bX2+cX+d
0 phosphoric acid charge/pH
0 1 2 3 4 5
Cross-validation

Choice of optimal number of components: principal source of


overfitting in PLS analyses.
Graphs q2 vs number of components help the selection!

0.7 1.2
0.6 1
0.5

r^2 fnal
0.8
0.4
Q^2

0.6
0.3
0.2 0.4
0.1 0.2
0 0
0 5 10 15 0 5 10 15
Number of components Number of components

Principal rule: have more than 5 observations by component !


Training Set Selection

A wide range of structurally diverse compounds

The range and the distribution of biological data

Congeneric Series (to the same receptor with the


same binding mode, the same active site)
Diverse vs Congeneric

We usually dont know the real situation!


Training Set Selection

N CN
N CN N
N
O CN
R Y X
Y X
R N N
CN O Set 3 : 18 compounds
R O
N N
O O
O Set 2 : 7 compounds
R

Set 1 : 16 compounds R
X R' O O O

Y R Set 6 : 22 compounds

N
OR N N
R'
N X O
O
R Y
O R
Set 5 :12 compounds O
Set 4 : 3 compounds Set 7 : 21 compounds
Training Set Selection

Sets 4 and 7:
10 not enough active (7)
9 or inactive (5)
compounds.
pIC50 MAO A

8
Sets 1, 2, 3 and 5:
7 poor distribution of
6 biological activities.
5
Set 6:
4 Broad range and
3 relatively well
distributed biological
0 1 2 3 4 5 6 7 8
activities
Sets (congeneric)
Training Set Selection

Statistical analyses for training set 6

Analysis Field(s) q2 N r2 s F ste ele lip


A S .743 3 .894 .522 50.4 100 - -
B E .433 1 .547 1.02 24.2 - 100 -
C L .598 3 .812 .694 25.8 - - 100
D S +E .594 2 .790 .713 35.8 45.1 54.9 -
E S +L .673 3 .870 .576 40.3 40.6 - 59.4
F E +L .474 2 .682 .878 20.4 - 44.9 55.1
G S +E +L .570 2 .760 .763 30.1 22.8 35.1 42.1

N: number of components
Training Set Selection

Statistical analyses for training set 1, 3, 5 and 6

Q2 from SAMPLS using a Leave One Out (LOO) procedure


Training Set Selection

Field(s) Sets 1 + 4 Sets 2+4 Sets 3+4 Sets


1+2+3
S 0.645 (1) 0.872 (1) 0.778 (2) -0.035 (1)
E 0.786 (2) 0.831 (1) 0.840 (2) 0.198 (3)
L 0.637 (2) 0.792 (1) 0.735 (2) -0.045 (1)
S+E 0.728 (1) 0.854 (1) 0.816 (2) 0.212 (4)
S+L 0.677 (1) 0.848 (1) 0.780 (2) -0.034 (1)
E+L 0.856 (3) 0.828 (1) 0.787 (2) 0.211 (4)
S+E+L 0.771 (1) 0.846 (1) 0.797 (2) 0.023 (1)

By combining sets 1, 2 and 3, no CoMFA model was found, presumably


due to the poor distribution of activities.
However, by combining set 4 with either set 1, 2 or 3, CoMFA produced
surprisingly good statistical models.
Training set selection

Statistics are markedly improved when set 4 (only 3 compounds of


high activity !) was added. However, it appears that the activities can
be separated in two clusters (poorly active and highly active
compounds). It is thus trivial to find good linear models (a
straight line via two points !).

S
Training set selection

The leave-one-out procedure was not able to


detect this pitfall.
When the crossvalidation was performed using
groups of crossvalidation, the q2 vary from very
good (> 0.8) to very bad (<-0.5, when all active
compounds were removed!).
Training set selection

CONCLUSION

The choice of the training set is of prime importance.

The division of training/test set is necessary for model


validation.
This is an example of correct alignment
in a receptor (ACE).
Molecular Field Calculation

Traditional CoMFA fields

Steric fields, Lennard-Jones potential


r
12
rprobe + rk
6

probe + rk
natoms
Ej =


rij


-2
rjk




k =1

Electrostatic fields, Coulomb potential


natoms q probe q k
Ej =
k = 1 r jk
Interpretation of CoMFA Results

Compounds with low activity Compounds with high activity

Steric

-
Electrostatic
Steric N +
+
Lipophilic
Polar
Electrostatic
-
Model Validation
Within the Training Set
Model Fitting
R2 : the square of correlation coefficient
RMSE : root mean square error
ANOVA F value, p value

Internal Predictability
Cross-validated R2, Bootstrap R2 , y-Scrambling

With the Test Set


External Predictability

Outside the training/test set


Real Predictability (?)
Outliers

Three aspects
Leverage = How far from Reds
Discrepancy = Out of line with others
Influence = combined leverage and discrepancy

Hi Leverage
Lo Leverage Hi Discrepancy
Hi Discrepancy Hi Influence
Mod Influence
Hi Leverage
Lo Discrepancy
Mod Influence
Residual Plot
Residual = observed Y predicted Y
Pattern
(a) ideal

(b) failure of normality -> grouping or recollection

(c) nonlinearity -> nonlinear fitting

(d) heteroscedasticity -> transformation Pattern

Detection of Outliers
Typical CoMFA Parameters

For the step of crossvalidation:


Cross-validated correlation coefficient, q2.
Optimal number of components, N.

For the final model:


Squared of correlation coefficient, r2.
Standard error of estimate, s.
Residuals.
F values.
Recommended Statistical Values for QSAR

Statistical parameter Statistical value

R2 >0.8

R2CV or Q2 >0.5

(Cross-validation R2)
Descriptors > 5 compounds per descriptor or
components
Data points At least 10, optimally above 20

Data range At least two orders of magnitude

Podlogar, B.; Meugge, I. And Brice, J. Curr. Opin. Drug Discovery Dev., 2001, 4(1), 102-109.
Summary: Factors Affecting CoMFA

Statistics

Training/Test Set Selection


Outlier Removal?
Variable Selection
Validation Methods

Others

Field Calculation Method


Summary: Factors Affecting CoMFA
NH2
Geometry

Bioactive Conformation? HO

Geometry Optimization Method?


Superposition between Actives?
Orientation against Lattice? 35 = 243
Lattice Size?
Probe Atom Charge and Size?
3. More Considerations

Cutoff, H-bonding CoMSIA


Conformer Generation & Superposition, Field Calculation,
Orientation Against Lattice ( # of points in Lattice )
Variable Selection Procedure (Regional Focusing, GOLPE)
Hard Potential (CoMFA) Soft Potential (CoMSIA)

Cutoff: 5 kcal/mol for steric, 30 kcal/mol for Coulomb


Improvement of CoMSIA over CoMFA

Soft Potential (Less affected by Lattice Point)

Contiguous Map (includes atomic points)

Entropic Contribution (Hydrophobic)

Less Conformation, Distance Dependency

No Arbitrary Cutoff and Scaling is needed.

H-bonding (important in chemistry)


Superposition Methods

( Manual / GA ) with ( Atom / Pharmacophore )


Superposition Methods

MCS detection

MCS: Maximal Common Subgraph


Superposition Methods

Field Fit
(dissimilar compounds)
(non atom center)
Superposition Methods

Molecular Skin

(skin volume can be changed


works well with significant different size)
Superposition Methods

Regional Overlap

weighing contributions of
one molecular region with
this overlap

resulting superposition can


be close to experiments
Field Calculation Methods

Additional fields in other software


Interaction energies (GRID/GOLPE)
hydrophobic field (HINT)
Molecular Lipophilicity Field (CLIP)
Grid-distributed target property (HASL)
etc. etc
Charge Calculation

Semiempirical/ab initio calculation followed


by population analysis
electrostatic potential fitted charge
Other empirical charges
Variable Selection Methods

Why Select Variables? ( Selection of Grid Points )

CoMFA is a very underdetermined method.


Less is better.
Minimizes the risk of chance effects
More predictive model
better interpretability

Essentially Subjective Procedure


GOLPE (Generating Optimal Linear PLS Estimations)
Guided Region Selection

Polyhedra
Coalescing
Conclusion of CoMFA

Many Choices
Training/Test set
Conformer Generation
Superposition
Field Calculation
Variable Selection
Statistical Validation

test several combination (all are related).

What is the best combination of choices?


Ultimately by experimental verification
practically by statistical
parameters
Beyond CoMFA
Advantages and Disadvantages of CoMFA & CoMSIA

Compared with 2D QSAR:

Good 3D Information
Can test Dissimilar Compounds

Problems and Some Improvements

Many models at once GFA


Not Really Interacting with a Certain Receptor RSM
Static (Lock & Key) 5D QSAR
Difficult to Handle Flexible Molecules HypoGen
Difficult for Virtual Screening Topomer CoMFA
GFA (Genetic Function Approximation: Cerius)

Combination of GA (Genetic Algorithm )


and MARS (Multiple Adaptive Regression Spline)
GA for multiple solutions
(GA is a population optimization method.)
Can test many models simultaneously
MARS for nonlinear regression
Various basis functions (splines, Gaussians, polynomials)
Linear regression is also possible.
Outputs of the multiple models can be averaged to gain
additional predictability.
Splines can be interpreted as range identification or
outlier removal.

Interpretations are often very difficult when nonlinear.


LOF(Lack of Fit) by Friedman

LSE
LOF 2
c dp
1
M

c : number of basis function


d : smoothing parameter (user definition)
p : number of feature
M : number of training molecule
Prevents over-fitting (minimum)
Fitness function for GA
MFA (Molecular Field Analysis: Cerius)

Very Similar to CoMFA

Data Evaluation by GFA

Rectangular, Spherical, Random Grid


Receptor Surface Model ( Pseudo-Receptor )

A model that characterizes the putative active site surface

Alignment of most active ligands bioactive conformation

Marching cubes isosurface algorithm


(triangulated surface points: 6 points / A2)
O
(A) (B)

(A) Van der Waals field function 1

V (r ) r R
V(r) 0

(B) Wyvill field function V (V DW )


V (W y v ill)

4 r 6 17 r 4 22 r 2 -1
V (r ) 6
4
2
1 0 1R 2R
9R 9 R 9 R
r
Receptor Surface Model ( Features & Usage )

Surface Properties
Partial charges, electrostatic potential, hydrogen bonding
propensity, hydrophobicity

Model Usage
Open / Closed model
Structures can be energy minimized within the receptor
surface model (Alignment, Docking)
3D-QSAR,
Virtual Screening (Catalyst/CatShape)
de novo design
Same Molecules, Different models, Different Predictions

< CoMFA > < RSM >


Fitting to Average Ligand Fitting to Average Receptor
4D QSAR Introduction

Quasar(Quasi-atomistic receptor modeling)

receptor site by a 3D envelope (steric nature)


(by many superimposed active structures)

properties as each surface points ( r=0.8 angstrom )


(hydrophobicity, partial charge, electrostatic potential,
hydrogen bond propensity)

4th dimension ensemble of conformation, orientation,


protonation states of ligands

ex) 3D + time by Albert Einstein

< local induced fit and H-bond flip flop >


4D QSAR Induced Fit Calculation

Number of property points are equal, but assigned


properties will be adapted for each molecule

OH

Induced fit is calculated by positional constraints


O

0.1kcal/molA2
4D QSAR Examples

Different ligand adaptation

Points on the Envelope


4D QSAR Procedure

Aligned Congeneric Actives Analysis of Model Family

steric nature
Estimation of Free Energy
Primordial Envelope of Ligand Binding

Ligand repositioning
Evolution of the Points

GA
Averaged Receptor Envelope Initial Family of Points

Fitness Function: LOF(RMS of Gpred vs Gexp)


5D QSAR Adding One More Dimension

4D QSAR plus

5th dimension the adaptation of the receptor binding


pocket to the individual ligand topology
- six different induced-fit protocols for
adapting the mean envelopes
- reduces bias associated with the choice
of bioactive conformation, alignment,
and induced fit model

- resulting geometry may be more absurd


Over-induced Fit
Coping with Conformational Explosion:

Until now, no issue of conformational explosion.


In cases where there are
many possible conformations, this becomes very annoying.

Consider all the conformations and alignments


as long as possible

HypoGen

Ignore all the conformations and alignments by selecting


only one conformation and one alignments

Topomer CoMFA
HypoGen Conformer Generation

To consider conformationally complex molecules,


conformational sampling is necessary (upto 250).

Poling Algorithm
(using potential for conformational diversity)
If a new conformer is similar to an existing conformer,
this one is penalized by a potential.
Ligands are represented by

H
Reduction of

Number of Points D A

< CoMFA > < HypoGen >


Lattice Points Pharmacophore Points

over 1000 points below 10 points


HypoGen Definitions

Pharmacophore A set of features and 3D orientation

Features Functional Groups of the same kind

Most Active Group MA X UncMA A / UncA > 0.0

Least Active Group log(A) log(MA) > 3.5

H H

D A D A
HypoGen Theory

Constructive Phase Pharmacophore Domain

Feasible Models
Subtractive Phase

Top
Scoring
Optimization Phase
Models
HypoGen Constructive Phase

1. Training Set

2. Identify the most active compounds

(Most active compounds)

3. Enumerate all possible


pharmacophore configurations Second Most
Active
4. Check for duplicates

5. Ensure that the rest of most active fit to


MinSubsetPoint Features (Reduce # of hypos)
HypoGen Substractive Phase

1. Training Set

2. Identify the least active compounds

(Least active compounds)

3. Enumerate all possible


pharmacophore configurations
Next Least
Active
4. Check for configurations shared with
the most active compounds

5. Eliminate if shared by more than half


of the least actives
HypoGen Optimization Phase
(From Feasible Pharmacophores)

1. Features and/or locations are varied to


optimize activity prediction via
SA(simulated annealing) approach

2. Geometric fits are calculated

3. Linear regression of log(activity) vs Geometric Fit

4. Total cost is calculated for each new hypothesis


(Best hypothesis has the lowest cost)

The fitness error (against activity data)


weight error (against feature weights)
HypoGen pros & cons

Can handle conformationally complex molecules


(upto 250 conformations, tetrapeptide)

Can be used as an input geometry such as CoMFA


(it is also one of the alignment methods)

Activity information is utilized for conformer/superposition

Provides reasonable selection criteria among many


hypotheses

SAR rather than QSAR


(No electrostatic, steric information is considered.)
Topomer CoMFA alignment

Adjust Torsions:
(1,3),(1,2),(5,8),(10,14)
Combi-Chem library

Direction: pointing away


15
from the FIT atoms
A Fixed Core
(14)O

10
(4) 9
1 S 11
2 3 8
5
12
7 6
13
Calculate steric field
like CoMFA
Topomer CoMFA Features

Single Conformation

Automatic Alignment & Superposition

Binned Field Values reducing the # of fields

Steric Fields: 0, 2, 4, 6, 8, .., 30 kcal/mol


Electrostatic Fields: -13, -11, .., 11, 13 kcal/mol

Attenuation Factor (0.85)


Topomer CoMFA Features

Attenuation Factor (0.85)

15
0.852
A Fixed Core
(14)O

10
(4) 9
1 S 11
2 3 8
5
12
13
7 6

1 0.85
Topomer CoMFA pros & cons

Fast
Virtual Screening Conscious
Lead Optimization
Lead Hopping (Lead Generation)

Is this really true? (high q2)


Requires common moiety (for fixed core)
More false positives
SAR rather than QSAR?
Conclusion:

3D QSAR methods:
Very robust
Lead optimization

For Virtual Screening

3D QSAR 3D SAR

You might also like