The Structure Lectures: Boris Steipe

The Structure
Lectures
Boris Steipe
boris.steipe@utoronto.ca http://biochemistry.utoronto.ca/steipe
Departments of Biochemistry and Molecular and Medical Genetics

Program in Proteomics and Bioinformatics
University of Toronto
9.0 1
Lecture 9.0:
Use of Protein Structure
Boris Steipe
boris.steipe@utoronto.ca http://biochemistry.utoronto.ca/steipe
Departments of Biochemistry and Molecular and Medical Genetics

Program in Proteomics and Bioinformatics
University of Toronto
( Some slides have been adapted from material by Chris Hogue, Toronto, prepared for CBW in 2002)
9.0 2
Concepts
1. "Sequence" and "structure" are abstractions of biopolymers.
2. Structure can be determined experimentally.
3. Structure abstractions can be stored, retrieved and visualized.
4. Knowledge of structure allows mechanistic explanations.
5. Structure is not arbitrary, but comes in units - motifs, helices,
strands, domains and complexes.
6. Domains are folding units, functional units and units of
inheritance.
9.0 3
Concept 1:
"Sequence" and
"structure" are
abstractions of
biopolymers.
9.0 4
Physical Amino Acids and
Amino Acid Abstractions
Formula: C9H9NO2
N
Smiles String†: [CH]
([NH][R])([C](=[O])[R])
[CH2]-[c]1([cH][cH][c]([cH] O OH
[cH]1)[OH])
Name: Tyrosine
3-Letter: Tyr
1-Letter: Y
ATOM 1091 N TYR 145 -35.676 -13.136 50.622 1.00 10.36
ATOM 1092 CA TYR 145 -36.931 -13.763 51.019 1.00 10.63
ATOM 1093 C TYR 145 -37.676 -12.879 52.016 1.00 11.16
ATOM 1094 O TYR 145 -37.061 -12.316 52.926 1.00 13.91
ATOM 1095 CB TYR 145 -36.660 -15.140 51.638 1.00 9.52
ATOM 1096 CG TYR 145 -37.845 -15.737 52.361 1.00 6.36
ATOM 1097 CD1 TYR 145 -38.144 -15.357 53.663 1.00 3.30
ATOM 1098 CD2 TYR 145 -38.691 -16.652 51.727 1.00 6.14
ATOM 1099 CE1 TYR 145 -39.248 -15.856 54.311 1.00 5.57
ATOM 1100 CE2 TYR 145 -39.804 -17.165 52.376 1.00 4.89
ATOM 1101 CZ TYR 145 -40.076 -16.757 53.670 1.00 4.35
ATOM 1102 OH TYR 145 -41.170 -17.231 54.345 1.00 4.44
†
http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html
9.0 5
The Concept of Abstract Amino Acids
Allows Highly Compressed Information
H-bond Donor Nucleophile
Bulky
Phospho-Acceptor
Hydrophobic
H-Bond Acceptor
Y
Aromatic
2° side chain
rotational freedom
9.0 6
The Concept of Abstract Amino
Acid Similarity is Lossy
H-bond Donor Nucleophile
(CHKNQRSTWY) (CDESTY)
Bulky
(FILQRYW)
Phospho-Acceptor
(STY)
Hydrophobic
(FAMILYVW) H-Bond Acceptor
(DEHNQSTY)
Y
Aromatic
(FWH)
2° side chain
rotational freedom
(CDFHSW)
9.0 7
Structure Contextualizes Sequence
… V V I Y T T G … (Tyr262 in 1ERQ.pdb)
9.0 8
Structural Abstraction
To store structures we  y Sulphur

need:  Carbon
x
Oxygen
z
Nitrogen
- coordinate
 
- topology, and
- chemical type

information. Met
9.0 9
Concept 2:
Structure can
be determined
experimentally.
9.0 10
Experimental sources of
structure
• Crystallization required
• Diffraction  data collection
• The phase problem: MAD, heavy
X-ray metal isomorphic derivatives ...
• ... or "Molecular replacement" give
phase approximations
NMR • Model building in electron density
maps
• Refinement
9.0 11
structure
Crystallization is limiting.
X-ray Diffraction is not imaging!
Refinement is required.
NMR Data Model
http://www-structure.llnl.gov/Xray/101index.html
9.0 12
structure
X-ray
• High concentration required
( ~ 1mM)
• Assignment of peaks ...
• ... determination of crosspeaks 
distance constraints
NMR • Calculation of models from
distance constraints
• Refinement
9.0 13
structure
X-ray
1DRO.PDB
Ensemble of structures that are compatible
Consensus model
with experimental distance constraints
Concentration/Solubility
NMR Assignment and NOEs
Refinement
9.0 14
Assessing structure quality
Metrics:
•Resolution, R-factor and R-free
•Bond length and angle deviations
•Coordinate error can be
estimated
from diffraction data http://www.sci.sdsu.edu/TFrey/Bio750/Bio750X-Ray.html
Programs Whatcheck and Procheck calculate quality metrics:
http://swift.cmbi.kun.nl/WIWWWI//fullcheck.html
http://www.biochem.ucl.ac.uk/~roman/procheck/procheck.html (also NMR)
Rules of thumb for "good structures":

Resolution 2Å, R-factor 20%, mean coordinate error 0.2 Å, RMSD bond-lengts: 0.02Å
9.0 15
Concept 3:
Structure
abstractions can
be stored,
retrieved and
visualized.
9.0 16
The
PDB
The PDB
is the QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.
primary
repository
of protein
structure
data.
http://www.rcsb.org/pdb
9.0 17
What’s in a Structure File?
• Population experiments
• X-ray, 1 structure
• NMR - sometimes many structures
• Incomplete - not all “atoms” are there
• Hydrogens, parts of the protein in motion
• Crystallographic “space”
• correct, but not always relevant
9.0 18
The PDB format
•Flat file, column oriented

•Human readable
•Human editable
•Huge legacy problems
Flat File: A datafile without indexing structure or hierarchy. In contrast, to

relational database, or data grammar.
9.0 19
Header
HEADER IMMUNOGLOBULIN 01-MAR-93 2IMM 2IMM 2
COMPND IMMUNOGLOBULIN VL DOMAIN (VARIABLE DOMAIN OF KAPPA LIGHT 2IMM 3
COMPND 2 CHAIN) OF MCPC603 2IMM 4
SOURCE HUMAN (HOMO $SAPIENS) RECOMBINANT SYNTHETIC M603 GENE 2IMM 5
AUTHOR B.STEIPE,R.HUBER 2IMM 6
REVDAT 1 15-JUL-93 2IMM 0 2IMM 7
REMARK 1 2IMM 8
REMARK 1 REFERENCE 1 2IMM 9
REMARK 1 AUTH B.STEIPE,A.PLUCKTHUN,R.HUBER 2IMM 10
REMARK 1 TITL REFINED CRYSTAL STRUCTURE OF A RECOMBINANT 2IMM 11
REMARK 1 TITL 2 IMMUNOGLOBULIN DOMAIN AND A 2IMM 12
REMARK 1 TITL 3 COMPLEMENTARITY-DETERMINING REGION 1-GRAFTED MUTANT 2IMM 13
REMARK 1 REF J.MOL.BIOL. V. 225 739 1992 2IMM 14
REMARK 1 REFN ASTM JMOBAK UK ISSN 0022-2836 070 2IMM 15
[...]
REMARK 2 2IMM 23
REMARK 2 RESOLUTION. 2.00 ANGSTROMS. 2IMM 24
REMARK 3 2IMM 25
[...]
9.0 20
Seqres
[...]
SEQRES 1 114 ASP ILE VAL MET THR GLN SER PRO SER SER LEU SER VAL 2IMM 35
SEQRES 2 114 SER ALA GLY GLU ARG VAL THR MET SER CYS LYS SER SER 2IMM 36
SEQRES 3 114 GLN SER LEU LEU ASN SER GLY ASN GLN LYS ASN PHE LEU 2IMM 37
SEQRES 4 114 ALA TRP TYR GLN GLN LYS PRO GLY GLN PRO PRO LYS LEU 2IMM 38
SEQRES 5 114 LEU ILE TYR GLY ALA SER THR ARG GLU SER GLY VAL PRO 2IMM 39
SEQRES 6 114 ASP ARG PHE THR GLY SER GLY SER GLY THR ASP PHE THR 2IMM 40
SEQRES 7 114 LEU THR ILE SER SER VAL GLN ALA GLU ASP LEU ALA VAL 2IMM 41
SEQRES 8 114 TYR TYR CYS GLN ASN ASP HIS SER TYR PRO LEU THR PHE 2IMM 42
SEQRES 9 114 GLY ALA GLY THR LYS LEU GLU LEU LYS ARG 2IMM 43
[...]
Explicit (above) and implicit sequence may differ !
9.0 21
Pitfalls:
Atom Atomname is a mix of Chemical element
and bond topology. "CA.." ≠ ".CA."
Sequence number is actually a string -
Atom Chain and insertion code are required to
number make it unique (e.g B 123A).
Amino acid
type
X
Y Z Occ
ATOM 119 CA ARG 18 8.386 51.105 35.847 1.00 7.30 2IMM 179
B
Sequence
number (Temperature factors)
Atom
name
Record
type PDB format is strictly column oriented !
9.0 22
Hetero Atoms
[...]
HETATM 877 O HOH 1 -4.169 60.050 40.145 1.00 3.00 2IMM 937
[...]
http://xray.bmc.uu.se/hicup/
9.0 23
The crystallographic asymmetric units does not
necessarily contain a functional molecule
The contents of a crystal
lattice unit cell can be
generated from the
asymmetric unit by
applying the required
symmetry operations for
the crystallographic
space-group. But neither
is this trivial for the
non-crystallographer,
nor is it obvious which
of the symmetry
replicates might make
1qpi.pdb Tet-repressor/operator complex physiological contacts.
9.0 24
... Biological Unit
PQS reasons
automatically
about how a
monomer
might be
correctly
QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.
completed to a
functional bio-
molecular
complex (and
is often
correct).
http://pqs.ebi.ac.uk/
9.0 25
NCBI
structure
group
MMDB - very
well integrated
but somewhat
impenetrable.
9.0 26
NDB
http://ndbserver.rutgers.edu/NDB/
urx035.pdb
(Hammerhead Ribozyme)
9.0 27
PDBsum - and "secondary"
structure databases
http://www.biochem.ucl.ac.uk/bsm/pdbsum/
9.0 28
PDBsum - Information
9.0 29
Others
Macromolecular Structure Database at EBI (Relibase, PQS ...)
http://www.ebi.ac.uk/msd/
Macromolecular structure related resources at the PDB

http://www.rcsb.org/pdb/links.html
Structure links at the Southwestern Biotechnology and Informatics Center

http://www.swbic.org/links/1.19.2.5.php
Molecular Models from Chemistry

http://people.ouc.bc.ca/woodcock/molecule/molecule.html
Molecular Library
http://www.nyu.edu/pages/mathmol/library/
.... many, many more.
9.0 30
Concept 4:
Knowledge of
structure allows
mechanistic
explanations.
9.0 31
Structure as an integrated map
- Example questions
• Which part of my structure appears to be conserved ?
• Are two functionally important residues possibly in contact ?
• Where is Asn220 relative to the active site ?
• May the mutation E123A possibly have something to do with
protein stability ?
• Is Leu234 on the surface, or in the core ?
• I want to clone my protein into a yeast two-hybrid system: should I
fuse the DNA binding domain to the N- or the C- terminus ?
9.0 32
Geometric relationships
• Bonds
• Angles, plain and dihedral
• Surfaces
• Chemical potential, amino acid functions
• Static and dynamic disorder
• Structural similarity
• Electrostatics
• Conservation patterns (structural and functional)
• Quarternary structure
• Posttranslational modification sites
• Unexpected homology
• [...]
9.0 33
Distances from
coordinates
XYZ coordinates are vectors in an
orthogonal coordinate system, in Å.
All the rules of analytical geometry apply.
[...]
ATOM 687 OH TYR 86 7.415 62.584 32.900 1.00 3.37
[...]
ATOM 651 O ASP 82 9.996 62.571 32.488 1.00 5.18
[...]
d = [(9.996-7.415)2 + (62.571-62.584)2 + (32.488-32.900)2]0.5

= [(2.581)2 + (-0.013)2 + (-0.412)2]0.5
= [6.661561 + 0.0000169 + 0.169744]0.5
= [6.831474]0.5
= 2.614 Å = 0.2614 nm = 2.614 . 10-10 m
9.0 34
Dihedral angles
i+3 Single bonds:

Freely rotable, but constrained
by steric overlap. Small
energetic barrier, preference for
i staggered conformations.
i+2 Double bonds:
Constrained to planar
geometry. Large energetic
i+1  barrier to isomerization.
9.0 35
Backbone dihedral angles:
Ramachandran plots



Rotatable Due to steric Allowed and Observed (,

bonds in the overlap, not all forbidden regions of values reflect the
backbone are combinations of (, space are theoretical
named , (, are shown on the boundaries well.
and . allowed. Ramachandran plot.
9.0 36
Sidechain rotamers
3
2

100 randomly chosen
Phe-residues superimposed.
Ponder & Richards (1987) J. Mol. Biol. 193, 775-791

http://dunbrack.fccc.edu/bbdep/
9.0 37
H-bond patterns
Example: TYR - Side Chain Donor
OH can donate a single hydrogen
(The OH-H bond is 1.00Å long and lies in the plane
of CE1, CE2, CZ and OH forming an angle of 110
degrees with the CZ-OH bond.)
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
Tyr-Thr sidechain H-bond:

Distribution of H-bond counts in all and buried residues, D-A distances, H- despite canonical geometry,
A distances and D-H-A angles inTyr sidechains. correct topology may be
ambiguous!
McDonald & Thornton (1994) J. Mol. Biol. 238, 777-793

http://www.biochem.ucl.ac.uk/bsm/atlas/
9.0 38
Molecular surface
Chain "A" of
1AON.PDB -
GroEL/ES complex
Surface rendering
of GroEL/ES
complex
(D. Goodsell)
9.0 39
Molecular surface
Surface provides a visual metaphore,
and a useful tool to map properties.
But how can a molecular surface be

defined ? Obviously, the hard-sphere
surface is chemically not very relevant.
Van der Waals surface
9.0 40
r= 1.4
Molecular surface
Probe !
9.0 41
Molecular surface
Contact surface
Accessible surface
"Accessible"

"Buried" Reentrant surface
9.0 42
Calculating solvent accessible
surfaces
1. Draw a sphere around each atom, with a radius of (VdW + solvent
probe ).
2. Erase all overlapping sphere surfaces.
3. The remaining area is the accessible surface.
r= 1.4Å
C: 1.75 Å
N: 1.55
O: 1.4Å
H: 1.17Å
9.0 43
Parameters and assumptions
Problem: Analytical solution inefficient.
Solution: Numerical solution with probe points
Problem: Regular placement of n probe points
Solution: Stochastic placement
Problem: Stochastic placement quite irregular
Solution: Enforce minimum separation
Problem: Efficiency
Solution: Place points only once, translate as needed
Problem: What is a good value for n ? u,v  [0,1]
Solution: Try different n, evaluate standard deviation
Problem: Should n be constant per atom, or per area ?  = 2u
Solution: dots/area - need to scale dots with r VdW
Problem: Hydrogens - where to get united atom radii ?  = cos-1 (2v–1)
Solution: Literature search.
Problem: Reference areas for relative SAA needed
Solution: Model explicitely, as tripeptides http://mathworld.wolfram.com/
SpherePointPicking.html
[...]
Even a straightforward algorithm has it's hidden parameters and assumptions.

Results are meaningful only in this context. Any comparison is problematic.
9.0 44
Mapping properties on
surfaces
•Properties of atoms (B-factors)
•Ensemble properties of residues
(hydrophobicity, conservation)
•Geometry (local curvature)
•Fields and potentials
(isosurfaces, binding potential)
AChE (1ACL.PDB) color coded by

electrostatic potential with
GRASP.
(http://trantor.bioc.
columbia.edu/grasp/)
9.0 45
Concept 5:
Structure is not
arbitrary, but
contains
recurring units.
9.0 46
Basic building blocks of
structure:
Eg. PROMOTIF - as used in PDBSUM
But: classical descriptions of structural building blocks are as much

based on idealized concepts of geometry as on observations of
nature. An unbiased analysis may arrive at significantly different
classifications !
9.0 47
Unbiased structure motifs:
alignment with added value
Motif alignments ... Why are particular
amino acids conserved? What is
essential in a sequence ?
A structure motif consensus sequence, compiled

from unrelated segments, averages out features of
conservation that are only due to incomplete
divergence (homology).
A consensus sequence, taken from different
structural contexts, averages out features of
sequence that are due to specific functional
(binding, catalysis) or non-local structural
requirements (packing, interaction).
What remains is information about sequence
propensities of local structural elements.
9.0 48
1ic
ag
A schematikon motif example:
complex loop
3.8
3.6
3.4
3.2
3.0
2.8
2.6
2.4
2.2
2.0
Motif:
Length:
1icf 215
7
1.8
1.6
1.4
1.2
1.0
0.8
0.6
Support: 7
0.4
Unique: 7
0.2
0.0
Rank: 399
9.0
123Po
4567
49
1w
A schematikon motif example:
strand N-cap
ag
3.8
3.6
3.4
3.2
3.0
2.8
2.6
2.4
2.2
Motif: 1whi 35
2.0
1.8
1.6
1.4
1.2
1.0
Length: 4
0.8
0.6
Support: 7
0.4
Unique: 7
0.2
0.0
Rank: 444
9.0
12
34
Po
50
Concept 6:
Domains are
folding units,
functional units, and
units of inheritance.
9.0 51
Domains are ubiquitous in
proteins
Large proteins are composed of compact,
semi-independent units - domains.
Reason:
Modularity
Folding efficiency
2MCP.PDB
9.0 52
Domains in proteins:
Number of
domains in 787
representative
proteins used as
the basis for the
CATH database
Jones S et al. (1998)

Protein Science 7:233
9.0 53
Non-random
relationship
between domain
number and
chain length in
the 787
representative
proteins used as
the basis for the
CATH database

9.0 54
Domain size in
the 787
representative
proteins used as
the basis for the
CATH database

9.0 55
There is no universal
definition of "domains"
Possible definitions are based on independently inherited (sub)sequences
(sequence domain), modular protein functions (functional domain), folding
unit or atomic contacts (structural domain).
Domain: A part of structure that can fold

irrespective of the presence of other
parts of structure
But: what is measured is commonly sequence, function, or structure - NOT FOLDING!
9.0 56
Further complications:
Analogous
structure,
Domain
insertions,
Circular
permutations,
Domain
swapping.
Domain insertion
1A2J.PDB 2TRX.PDB
Protein disulfide isomerase Thioredoxin
9.0 57
Analogous
structure,
Domain
insertions,
Circular
permutations,
Domain 253
swapping.
Circular permutation
1ERQ.PDB
1ALQ.PDB
beta lactamase
beta lactamase
9.0 58
Analogous
structure,
Domain
insertions,
Circular
permutations,
Domain
swapping.
Domain swapping
11BG.PDB
Bull seminal ribonuclease
9.0 59
Domains can be elusive:
The separation of a structure into

domains requires the arbitrary
informed
definition of thresholds in a
continuum of possibilities.
9.0 60
Why care ?
Function:
evolution works on sequence, but selects function.
Definition of domains in structure can uncover functional units
that may evolve independently. Sequence searches, alignments
etc. with domains are much more specific.
Once structural domains have been defined, sequence profiles,

HMMs or other computational procedures can be used to pick
out more members of the domain family from the database.
Domains can be defined from sequence patterns, or from the

analyis of structure.
9.0 61
Automated (objective) domain
definition: - Sequence (CDD)
http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml
CDD
from Smart
and Pfam
CDART
from CDD
and Genbank
9.0 62
SemiAutomated consensus domain
definition: - Structure (CATH)
Dehydrolipoamide
dehydrogenase 1LPFA:
Jones S et al. (1998) Domain assignment for protein structures using a consensus
approach: Chracterization and analysis. Protein Science 7:233-242
9.0 63
SCOP & CATH: structural classification
The eight
most
frequent
SCOP
Superfolds
http://scop.mrc-lmb.cam.ac.uk/scop/
http://www.biochem.ucl.ac.uk/bsm/cath/
9.0 64
CATH - Class
Class1: Mainly Alpha Class 2: Mainly Beta Class 3: Mixed Class4: Few
Alpha/Beta Secondary
Structures
9.0 65
CATH - Architecture
Roll Super Roll Barrel 2-Layer Sandwich
9.0 66
CATH - Topology
L-fucose Isomerase Serine Protease Aconitase, domain TIM Barrel

4
9.0 67
CATH - Homology
Alanine racemase Dihydropteroate FMN dependent 7-stranded

(DHP) fluorescent glycosidases
synthetase
proteins
9.0 68
CATH -
Entry
(Example)
9.0 69
IV: Open Issues
I: Integration into processes, scriptable APIs
II: Sequence based identification of domains
III: Analysing domains in context
IV: Defining modular domain functions
9.0 70
Bioinformaticians apparently
do not like structure !
Sequence: Structure:
• Discrete alphabet • Continuous space
• Linear algebra, complicated
• Easy to manipulate
energy functions
• Well developed • Databases and
datastructures datastructures are difficult
• Well developed libraries • Paucity of libraries
Meet the challenge !

9.0 71

The Structure Lectures: Boris Steipe

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Structure Lectures: Boris Steipe

Uploaded by

Copyright:

Available Formats

The Structure

Departments of Biochemistry and Molecular and Medical Genetics

Departments of Biochemistry and Molecular and Medical Genetics

To store structures we  y Sulphur

NMR Data Model

Programs Whatcheck and Procheck calculate quality metrics:

Rules of thumb for "good structures":

•Flat file, column oriented

Flat File: A datafile without indexing structure or hierarchy. In contrast, to

Explicit (above) and implicit sequence may differ !

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.

Macromolecular structure related resources at the PDB

Structure links at the Southwestern Biotechnology and Informatics Center

Molecular Models from Chemistry

.... many, many more.

d = [(9.996-7.415)2 + (62.571-62.584)2 + (32.488-32.900)2]0.5

i+3 Single bonds:

Rotatable Due to steric Allowed and Observed (,

Ponder & Richards (1987) J. Mol. Biol. 193, 775-791

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Tyr-Thr sidechain H-bond:

McDonald & Thornton (1994) J. Mol. Biol. 238, 777-793

But how can a molecular surface be

Van der Waals surface

Van der Waals surface

Van der Waals surface

Even a straightforward algorithm has it's hidden parameters and assumptions.

AChE (1ACL.PDB) color coded by

But: classical descriptions of structural building blocks are as much

A structure motif consensus sequence, compiled

Jones S et al. (1998)

Jones S et al. (1998)

Jones S et al. (1998)

Domain: A part of structure that can fold

But: what is measured is commonly sequence, function, or structure - NOT FOLDING!

The separation of a structure into

Once structural domains have been defined, sequence profiles,

Domains can be defined from sequence patterns, or from the

Roll Super Roll Barrel 2-Layer Sandwich

L-fucose Isomerase Serine Protease Aconitase, domain TIM Barrel

Alanine racemase Dihydropteroate FMN dependent 7-stranded

I: Integration into processes, scriptable APIs

II: Sequence based identification of domains

III: Analysing domains in context

IV: Defining modular domain functions

Meet the challenge !

You might also like