You are on page 1of 71

The Structure

Lectures
Boris Steipe
boris.steipe@utoronto.ca http://biochemistry.utoronto.ca/steipe

Departments of Biochemistry and Molecular and Medical Genetics


Program in Proteomics and Bioinformatics
University of Toronto

9.0 1
Lecture 9.0:
Use of Protein Structure

Boris Steipe
boris.steipe@utoronto.ca http://biochemistry.utoronto.ca/steipe

Departments of Biochemistry and Molecular and Medical Genetics


Program in Proteomics and Bioinformatics
University of Toronto

( Some slides have been adapted from material by Chris Hogue, Toronto, prepared for CBW in 2002)

9.0 2
Concepts
1. "Sequence" and "structure" are abstractions of biopolymers.
2. Structure can be determined experimentally.
3. Structure abstractions can be stored, retrieved and visualized.
4. Knowledge of structure allows mechanistic explanations.
5. Structure is not arbitrary, but comes in units - motifs, helices,
strands, domains and complexes.
6. Domains are folding units, functional units and units of
inheritance.

9.0 3
Concept 1:

"Sequence" and
"structure" are
abstractions of
biopolymers.
9.0 4
Physical Amino Acids and
Amino Acid Abstractions
Formula: C9H9NO2
N
Smiles String†: [CH]
([NH][R])([C](=[O])[R])
[CH2]-[c]1([cH][cH][c]([cH] O OH

[cH]1)[OH])
Name: Tyrosine
3-Letter: Tyr

1-Letter: Y
ATOM 1091 N TYR 145 -35.676 -13.136 50.622 1.00 10.36
ATOM 1092 CA TYR 145 -36.931 -13.763 51.019 1.00 10.63
ATOM 1093 C TYR 145 -37.676 -12.879 52.016 1.00 11.16
ATOM 1094 O TYR 145 -37.061 -12.316 52.926 1.00 13.91
ATOM 1095 CB TYR 145 -36.660 -15.140 51.638 1.00 9.52
ATOM 1096 CG TYR 145 -37.845 -15.737 52.361 1.00 6.36
ATOM 1097 CD1 TYR 145 -38.144 -15.357 53.663 1.00 3.30
ATOM 1098 CD2 TYR 145 -38.691 -16.652 51.727 1.00 6.14
ATOM 1099 CE1 TYR 145 -39.248 -15.856 54.311 1.00 5.57
ATOM 1100 CE2 TYR 145 -39.804 -17.165 52.376 1.00 4.89
ATOM 1101 CZ TYR 145 -40.076 -16.757 53.670 1.00 4.35
ATOM 1102 OH TYR 145 -41.170 -17.231 54.345 1.00 4.44


http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html

9.0 5
The Concept of Abstract Amino Acids
Allows Highly Compressed Information
H-bond Donor Nucleophile
Bulky

Phospho-Acceptor

Hydrophobic
H-Bond Acceptor

Y
Aromatic

2° side chain
rotational freedom

9.0 6
The Concept of Abstract Amino
Acid Similarity is Lossy
H-bond Donor Nucleophile
(CHKNQRSTWY) (CDESTY)
Bulky
(FILQRYW)
Phospho-Acceptor
(STY)

Hydrophobic
(FAMILYVW) H-Bond Acceptor
(DEHNQSTY)

Y
Aromatic
(FWH)

2° side chain
rotational freedom
(CDFHSW)

9.0 7
Structure Contextualizes Sequence

… V V I Y T T G … (Tyr262 in 1ERQ.pdb)

9.0 8
Structural Abstraction

To store structures we  y Sulphur


need:  Carbon
x
Oxygen
z
Nitrogen
- coordinate
 
- topology, and
- chemical type

information. Met

9.0 9
Concept 2:

Structure can
be determined
experimentally.
9.0 10
Experimental sources of
structure
• Crystallization required
• Diffraction  data collection
• The phase problem: MAD, heavy
X-ray metal isomorphic derivatives ...
• ... or "Molecular replacement" give
phase approximations
NMR • Model building in electron density
maps
• Refinement

9.0 11
Experimental sources of
structure
Crystallization is limiting.
X-ray Diffraction is not imaging!
Refinement is required.

NMR Data Model

http://www-structure.llnl.gov/Xray/101index.html

9.0 12
Experimental sources of
structure
X-ray
• High concentration required
( ~ 1mM)
• Assignment of peaks ...
• ... determination of crosspeaks 
distance constraints
NMR • Calculation of models from
distance constraints
• Refinement

9.0 13
Experimental sources of
structure

X-ray
1DRO.PDB
Ensemble of structures that are compatible
Consensus model
with experimental distance constraints

Concentration/Solubility
NMR Assignment and NOEs
Refinement

9.0 14
Assessing structure quality
Metrics:
•Resolution, R-factor and R-free
•Bond length and angle deviations
•Coordinate error can be
estimated
from diffraction data http://www.sci.sdsu.edu/TFrey/Bio750/Bio750X-Ray.html

Programs Whatcheck and Procheck calculate quality metrics:

http://swift.cmbi.kun.nl/WIWWWI//fullcheck.html
http://www.biochem.ucl.ac.uk/~roman/procheck/procheck.html (also NMR)

Rules of thumb for "good structures":


Resolution 2Å, R-factor 20%, mean coordinate error 0.2 Å, RMSD bond-lengts: 0.02Å

9.0 15
Concept 3:

Structure
abstractions can
be stored,
retrieved and
visualized.
9.0 16
The
PDB

The PDB
is the QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.

primary
repository
of protein
structure
data.

http://www.rcsb.org/pdb

9.0 17
What’s in a Structure File?
• Population experiments
• X-ray, 1 structure
• NMR - sometimes many structures
• Incomplete - not all “atoms” are there
• Hydrogens, parts of the protein in motion
• Crystallographic “space”
• correct, but not always relevant

9.0 18
The PDB format

•Flat file, column oriented


•Human readable
•Human editable
•Huge legacy problems

Flat File: A datafile without indexing structure or hierarchy. In contrast, to


relational database, or data grammar.

9.0 19
Header
HEADER IMMUNOGLOBULIN 01-MAR-93 2IMM 2IMM 2
COMPND IMMUNOGLOBULIN VL DOMAIN (VARIABLE DOMAIN OF KAPPA LIGHT 2IMM 3
COMPND 2 CHAIN) OF MCPC603 2IMM 4
SOURCE HUMAN (HOMO $SAPIENS) RECOMBINANT SYNTHETIC M603 GENE 2IMM 5
AUTHOR B.STEIPE,R.HUBER 2IMM 6
REVDAT 1 15-JUL-93 2IMM 0 2IMM 7
REMARK 1 2IMM 8
REMARK 1 REFERENCE 1 2IMM 9
REMARK 1 AUTH B.STEIPE,A.PLUCKTHUN,R.HUBER 2IMM 10
REMARK 1 TITL REFINED CRYSTAL STRUCTURE OF A RECOMBINANT 2IMM 11
REMARK 1 TITL 2 IMMUNOGLOBULIN DOMAIN AND A 2IMM 12
REMARK 1 TITL 3 COMPLEMENTARITY-DETERMINING REGION 1-GRAFTED MUTANT 2IMM 13
REMARK 1 REF J.MOL.BIOL. V. 225 739 1992 2IMM 14
REMARK 1 REFN ASTM JMOBAK UK ISSN 0022-2836 070 2IMM 15

[...]

REMARK 2 2IMM 23
REMARK 2 RESOLUTION. 2.00 ANGSTROMS. 2IMM 24
REMARK 3 2IMM 25

[...]

9.0 20
Seqres

[...]
SEQRES 1 114 ASP ILE VAL MET THR GLN SER PRO SER SER LEU SER VAL 2IMM 35
SEQRES 2 114 SER ALA GLY GLU ARG VAL THR MET SER CYS LYS SER SER 2IMM 36
SEQRES 3 114 GLN SER LEU LEU ASN SER GLY ASN GLN LYS ASN PHE LEU 2IMM 37
SEQRES 4 114 ALA TRP TYR GLN GLN LYS PRO GLY GLN PRO PRO LYS LEU 2IMM 38
SEQRES 5 114 LEU ILE TYR GLY ALA SER THR ARG GLU SER GLY VAL PRO 2IMM 39
SEQRES 6 114 ASP ARG PHE THR GLY SER GLY SER GLY THR ASP PHE THR 2IMM 40
SEQRES 7 114 LEU THR ILE SER SER VAL GLN ALA GLU ASP LEU ALA VAL 2IMM 41
SEQRES 8 114 TYR TYR CYS GLN ASN ASP HIS SER TYR PRO LEU THR PHE 2IMM 42
SEQRES 9 114 GLY ALA GLY THR LYS LEU GLU LEU LYS ARG 2IMM 43
[...]

Explicit (above) and implicit sequence may differ !

9.0 21
Pitfalls:
Atom Atomname is a mix of Chemical element
and bond topology. "CA.." ≠ ".CA."
Sequence number is actually a string -
Atom Chain and insertion code are required to
number make it unique (e.g B 123A).

Amino acid
type
X
Y Z Occ

ATOM 119 CA ARG 18 8.386 51.105 35.847 1.00 7.30 2IMM 179
B
Sequence
number (Temperature factors)
Atom
name

Record
type PDB format is strictly column oriented !

9.0 22
Hetero Atoms
[...]
HETATM 877 O HOH 1 -4.169 60.050 40.145 1.00 3.00 2IMM 937
[...]

http://xray.bmc.uu.se/hicup/

9.0 23
The crystallographic asymmetric units does not
necessarily contain a functional molecule
The contents of a crystal
lattice unit cell can be
generated from the
asymmetric unit by
applying the required
symmetry operations for
the crystallographic
space-group. But neither
is this trivial for the
non-crystallographer,
nor is it obvious which
of the symmetry
replicates might make
1qpi.pdb Tet-repressor/operator complex physiological contacts.

9.0 24
... Biological Unit
PQS reasons
automatically
about how a
monomer
might be
correctly
QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.
completed to a
functional bio-
molecular
complex (and
is often
correct).

http://pqs.ebi.ac.uk/

9.0 25
NCBI
structure
group

MMDB - very
well integrated
but somewhat
impenetrable.

9.0 26
NDB

http://ndbserver.rutgers.edu/NDB/

urx035.pdb
(Hammerhead Ribozyme)

9.0 27
PDBsum - and "secondary"
structure databases

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.

http://www.biochem.ucl.ac.uk/bsm/pdbsum/

9.0 28
PDBsum - Information

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.

9.0 29
Others
Macromolecular Structure Database at EBI (Relibase, PQS ...)
http://www.ebi.ac.uk/msd/

Macromolecular structure related resources at the PDB


http://www.rcsb.org/pdb/links.html

Structure links at the Southwestern Biotechnology and Informatics Center


http://www.swbic.org/links/1.19.2.5.php

Molecular Models from Chemistry


http://people.ouc.bc.ca/woodcock/molecule/molecule.html

Molecular Library
http://www.nyu.edu/pages/mathmol/library/

.... many, many more.

9.0 30
Concept 4:

Knowledge of
structure allows
mechanistic
explanations.
9.0 31
Structure as an integrated map
- Example questions
• Which part of my structure appears to be conserved ?
• Are two functionally important residues possibly in contact ?
• Where is Asn220 relative to the active site ?
• May the mutation E123A possibly have something to do with
protein stability ?
• Is Leu234 on the surface, or in the core ?
• I want to clone my protein into a yeast two-hybrid system: should I
fuse the DNA binding domain to the N- or the C- terminus ?

9.0 32
Geometric relationships
• Bonds
• Angles, plain and dihedral
• Surfaces
• Chemical potential, amino acid functions
• Static and dynamic disorder
• Structural similarity
• Electrostatics
• Conservation patterns (structural and functional)
• Quarternary structure
• Posttranslational modification sites
• Unexpected homology
• [...]

9.0 33
Distances from
coordinates
XYZ coordinates are vectors in an
orthogonal coordinate system, in Å.
All the rules of analytical geometry apply.

[...]
ATOM 687 OH TYR 86 7.415 62.584 32.900 1.00 3.37
[...]
ATOM 651 O ASP 82 9.996 62.571 32.488 1.00 5.18
[...]

d = [(9.996-7.415)2 + (62.571-62.584)2 + (32.488-32.900)2]0.5


= [(2.581)2 + (-0.013)2 + (-0.412)2]0.5
= [6.661561 + 0.0000169 + 0.169744]0.5
= [6.831474]0.5
= 2.614 Å = 0.2614 nm = 2.614 . 10-10 m

9.0 34
Dihedral angles

i+3 Single bonds:


Freely rotable, but constrained
by steric overlap. Small
energetic barrier, preference for
i staggered conformations.
i+2 Double bonds:
Constrained to planar
geometry. Large energetic
i+1  barrier to isomerization.

9.0 35
Backbone dihedral angles:
Ramachandran plots


Rotatable Due to steric Allowed and Observed (,


bonds in the overlap, not all forbidden regions of values reflect the
backbone are combinations of (, space are theoretical
named , (, are shown on the boundaries well.
and . allowed. Ramachandran plot.

9.0 36
Sidechain rotamers

3
2

100 randomly chosen
Phe-residues superimposed.

Ponder & Richards (1987) J. Mol. Biol. 193, 775-791


http://dunbrack.fccc.edu/bbdep/

9.0 37
H-bond patterns
Example: TYR - Side Chain Donor
OH can donate a single hydrogen
(The OH-H bond is 1.00Å long and lies in the plane
of CE1, CE2, CZ and OH forming an angle of 110
degrees with the CZ-OH bond.)

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Tyr-Thr sidechain H-bond:


Distribution of H-bond counts in all and buried residues, D-A distances, H- despite canonical geometry,
A distances and D-H-A angles inTyr sidechains. correct topology may be
ambiguous!

McDonald & Thornton (1994) J. Mol. Biol. 238, 777-793


http://www.biochem.ucl.ac.uk/bsm/atlas/

9.0 38
Molecular surface

Chain "A" of
1AON.PDB -
GroEL/ES complex

Surface rendering
of GroEL/ES
complex
(D. Goodsell)

9.0 39
Molecular surface
Surface provides a visual metaphore,
and a useful tool to map properties.

But how can a molecular surface be


defined ? Obviously, the hard-sphere
surface is chemically not very relevant.

Van der Waals surface

9.0 40
r= 1.4
Molecular surface

Probe !

Van der Waals surface

9.0 41
Molecular surface
Contact surface

Accessible surface

"Accessible"

Van der Waals surface


"Buried" Reentrant surface

9.0 42
Calculating solvent accessible
surfaces
1. Draw a sphere around each atom, with a radius of (VdW + solvent
probe ).
2. Erase all overlapping sphere surfaces.
3. The remaining area is the accessible surface.

r= 1.4Å
C: 1.75 Å
N: 1.55
O: 1.4Å
H: 1.17Å

9.0 43
Parameters and assumptions
Problem: Analytical solution inefficient.
Solution: Numerical solution with probe points
Problem: Regular placement of n probe points
Solution: Stochastic placement
Problem: Stochastic placement quite irregular
Solution: Enforce minimum separation
Problem: Efficiency
Solution: Place points only once, translate as needed
Problem: What is a good value for n ? u,v  [0,1]
Solution: Try different n, evaluate standard deviation
Problem: Should n be constant per atom, or per area ?  = 2u
Solution: dots/area - need to scale dots with r VdW
Problem: Hydrogens - where to get united atom radii ?  = cos-1 (2v–1)
Solution: Literature search.
Problem: Reference areas for relative SAA needed
Solution: Model explicitely, as tripeptides http://mathworld.wolfram.com/
SpherePointPicking.html
[...]

Even a straightforward algorithm has it's hidden parameters and assumptions.


Results are meaningful only in this context. Any comparison is problematic.

9.0 44
Mapping properties on
surfaces
•Properties of atoms (B-factors)
•Ensemble properties of residues
(hydrophobicity, conservation)
•Geometry (local curvature)
•Fields and potentials
(isosurfaces, binding potential)

AChE (1ACL.PDB) color coded by


electrostatic potential with
GRASP.
(http://trantor.bioc.
columbia.edu/grasp/)

9.0 45
Concept 5:

Structure is not
arbitrary, but
contains
recurring units.
9.0 46
Basic building blocks of
structure:
Eg. PROMOTIF - as used in PDBSUM

But: classical descriptions of structural building blocks are as much


based on idealized concepts of geometry as on observations of
nature. An unbiased analysis may arrive at significantly different
classifications !

9.0 47
Unbiased structure motifs:
alignment with added value
Motif alignments ... Why are particular
amino acids conserved? What is
essential in a sequence ?

A structure motif consensus sequence, compiled


from unrelated segments, averages out features of
conservation that are only due to incomplete
divergence (homology).
A consensus sequence, taken from different
structural contexts, averages out features of
sequence that are due to specific functional
(binding, catalysis) or non-local structural
requirements (packing, interaction).
What remains is information about sequence
propensities of local structural elements.

9.0 48
1ic
ag
A schematikon motif example:
complex loop
3.8
3.6
3.4
3.2
3.0
2.8
2.6
2.4
2.2
2.0
Motif:
Length:
1icf 215
7
1.8
1.6
1.4
1.2
1.0
0.8
0.6
Support: 7

0.4
Unique: 7

0.2
0.0
Rank: 399

9.0
123Po
4567
49
1w
A schematikon motif example:
strand N-cap
ag
3.8
3.6
3.4
3.2
3.0
2.8
2.6
2.4
2.2
Motif: 1whi 35
2.0
1.8
1.6
1.4
1.2
1.0
Length: 4

0.8
0.6
Support: 7

0.4
Unique: 7

0.2
0.0
Rank: 444

9.0
12
34
Po
50
Concept 6:

Domains are
folding units,
functional units, and
units of inheritance.
9.0 51
Domains are ubiquitous in
proteins
Large proteins are composed of compact,
semi-independent units - domains.

Reason:
Modularity
Folding efficiency

2MCP.PDB

9.0 52
Domains in proteins:
Number of
domains in 787
representative
proteins used as
the basis for the
CATH database

Jones S et al. (1998)


Protein Science 7:233

9.0 53
Domains in proteins:
Non-random
relationship
between domain
number and
chain length in
the 787
representative
proteins used as
the basis for the
CATH database

Jones S et al. (1998)


Protein Science 7:233

9.0 54
Domains in proteins:
Domain size in
the 787
representative
proteins used as
the basis for the
CATH database

Jones S et al. (1998)


Protein Science 7:233

9.0 55
There is no universal
definition of "domains"
Possible definitions are based on independently inherited (sub)sequences
(sequence domain), modular protein functions (functional domain), folding
unit or atomic contacts (structural domain).

Domain: A part of structure that can fold


irrespective of the presence of other
parts of structure

But: what is measured is commonly sequence, function, or structure - NOT FOLDING!

9.0 56
Further complications:
Analogous
structure,

Domain
insertions,

Circular
permutations,

Domain
swapping.

Domain insertion
1A2J.PDB 2TRX.PDB
Protein disulfide isomerase Thioredoxin

9.0 57
Further complications:
Analogous
structure,

Domain
insertions,

Circular
permutations,

Domain 253
swapping.
Circular permutation
1ERQ.PDB
1ALQ.PDB
beta lactamase
beta lactamase

9.0 58
Further complications:
Analogous
structure,

Domain
insertions,

Circular
permutations,

Domain
swapping.

Domain swapping
11BG.PDB
Bull seminal ribonuclease

9.0 59
Domains can be elusive:

The separation of a structure into


domains requires the arbitrary
informed
definition of thresholds in a
continuum of possibilities.

9.0 60
Why care ?
Function:
evolution works on sequence, but selects function.
Definition of domains in structure can uncover functional units
that may evolve independently. Sequence searches, alignments
etc. with domains are much more specific.

Once structural domains have been defined, sequence profiles,


HMMs or other computational procedures can be used to pick
out more members of the domain family from the database.

Domains can be defined from sequence patterns, or from the


analyis of structure.

9.0 61
Automated (objective) domain
definition: - Sequence (CDD)
http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml

CDD
from Smart
and Pfam

CDART
from CDD
and Genbank

9.0 62
SemiAutomated consensus domain
definition: - Structure (CATH)

Dehydrolipoamide
dehydrogenase 1LPFA:

Jones S et al. (1998) Domain assignment for protein structures using a consensus
approach: Chracterization and analysis. Protein Science 7:233-242

9.0 63
SCOP & CATH: structural classification

The eight
most
frequent
SCOP
Superfolds

http://scop.mrc-lmb.cam.ac.uk/scop/
http://www.biochem.ucl.ac.uk/bsm/cath/

9.0 64
CATH - Class

Class1: Mainly Alpha Class 2: Mainly Beta Class 3: Mixed Class4: Few
Alpha/Beta Secondary
Structures

9.0 65
CATH - Architecture

Roll Super Roll Barrel 2-Layer Sandwich

9.0 66
CATH - Topology

L-fucose Isomerase Serine Protease Aconitase, domain TIM Barrel


4

9.0 67
CATH - Homology

Alanine racemase Dihydropteroate FMN dependent 7-stranded


(DHP) fluorescent glycosidases
synthetase
proteins

9.0 68
CATH -
Entry

(Example)
QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.

9.0 69
IV: Open Issues

I: Integration into processes, scriptable APIs

II: Sequence based identification of domains

III: Analysing domains in context

IV: Defining modular domain functions

9.0 70
Bioinformaticians apparently
do not like structure !
Sequence: Structure:
• Discrete alphabet • Continuous space
• Linear algebra, complicated
• Easy to manipulate
energy functions
• Well developed • Databases and
datastructures datastructures are difficult
• Well developed libraries • Paucity of libraries

Meet the challenge !


9.0 71

You might also like