You are on page 1of 69

Bioinformatics

and
structural genomics

John Ionides
Barton Group Tools
(http://www.compbio.dundee.ac.uk/)

• JalView multiple sequence alignment analysis


• Jnet secondary structure prediction for globular proteins
• SCANPS sequence search against protein or nucleic acid
database
• STAMP comparison and alignment of protein three
dimensional structures

… and many others


G
E D
L H FUNCTION?
I Y S M
...A A
K...
Protein Sequence

-helix

-strand

Secondary Structure
Fold
Topics
• Context
• Archiving structural information
– Data harvesting
• The structure of the archive
– Current status
– Harnessing database technologies
• Integrating NMR: CCPN
• Accessing NMR data
– Validation criteria
EBI- all major databases
Megabases

Genome information
NMR + XRAY
entries in PDB

Structural information
Structural Genomics

Naive aim: Determine the three dimensional structure of all


proteins in a genome.

More Realistic aim: Determine the 3D structure of most globular


proteins in a genome.

Realistic aim: Determine the 3D structure of the easy to express,


purify and crystallise proteins in a genome.

7 Pilot projects in the USA. Planned projects in the UK and


Europe.
Structural Genomics
Target Selection

Cloning Expression and Purification

Structure Determination

Archiving and Annotation

Analysis of the Data


Human Genome ->Human
Proteome
• 30 000 to 60 000 protein coding genes
• Number of different proteins larger-
– alternative splicing (30 to 50% increase);
– post-translational modifications (5 to 10 fold increase)

• > 100 000 different transcripts and > 1 000 000 different protein
molecules in the human body

• There is not one human transcriptome or proteome but many


human transcriptomes and proteomes, depending on individual
genomic background, developmental stage, tissue and cell type,
and environmental influences
Archiving structural data
data flow between archives
AutoDep

EBI

pdb,
mmCIF NMRSTAR
BMRB

pdb RCSB

ADIT
Secondary structure,
disulphide bonds, cis linkages

• No longer requested during deposition.

• Determined automatically during


processing
– Can then be changed during review proccess
Het Groups

• Want to standardise representations of small


molecules within the structural archive

• When a molecule is deposited it will


automatically be allocated the correct three
letter code and atom names
Matching HET groups

• Need to take into account of missing atoms

O O O
O O O O O

O O
O
O O
O
• Both should be picked up as –glucose and
labelled as GLC
Technology: Matching
• Graph Matching
A procedure for finding common parts in two or
more graphs

Graph-graph Graph-subgraph Subgraph-subgraph


matching matching matching
Technology: History
 Classical Approach
Looking for a maximal clique of an associated graph;
Best algorithm: by C.Bron and J.Kerbosch (1971)
Space consumption: O((mn)2)
Best case complexity: O(mn)
Worst case complexity: O((mn)n), m  n
Advantage: suitable for both graph-(sub)graph and
subgraph-subgraph matching
Disadvantage: prohibitive performance at n,m>20
Technology: Ullman
• Advanced technique
Enumeration scheme with forward checking for dead-end branches;
Best algorithm: by J.R.Ullman (1976)
Space consumption: O(mn2)
Best case complexity: O(mn)
Worst case complexity: O(mnn2) , m  n
Advantage: considerably outperforms the maximal
clique algorithm for n,m>10
Disadvantage: allows only for graph-(sub)graph
matching
Development
• Subgraph-subgraph matching
Enumeration scheme with forward checking for dead-end branches
(Ullman’s algorithm) has been extended for the subgraph-subgraph
matching.
Space consumption: O(mn2)
Best case complexity: O(mn2)
Worst case complexity: O(mnn3)
Controllable complexity: the actual performance depends on the
required quality of match and for graph-(sub)graph
matching reduces to that of Ullman’s algorithm
Implementation: HetGroups
• Recognition of chemical compounds
Graph representation for all chemical compounds, currently contained
in Oracle database, is compiled into fast-accessible binary files.
Performance of compilation: 2400 compounds / min

A graph matching utility allows to match given compound against the


whole data set as well as against individual precompiled compounds.
Performance of matching: up to 32000 compounds / min

Language: C++ with FORTRAN interface, UNIX


The precompiled binary files also contain easy-accessible information
on atom and compound names, synonyms, chemical formulas,
charges and leaving atoms.
Data harvesting

Much of the data that would be useful to


archive is known by software during the
structure determination process

• Each piece of software used during the


determination of a structure should
contribute to an information pool that can
later form the basis of the archive
XRAY project outline

data
data structure model building
processing & databases
collection solution & refinement
reduction
XRAY data harvesting: ideal
relations
&
indices

data
data structure model building
processing & databases
collection solution & refinement
reduction

sample processing processing all other


molecules
details parameters parameters data
XRAY data harvesting: current
relations
&
indices

data
data structure model building
processing & databases
collection solution & refinement
reduction

sample processing processing all other


molecules
details parameters parameters data
The structure of the archive
data flow between archives
AutoDep

EBI

pdb,
mmCIF NMRSTAR
BMRB

pdb RCSB

ADIT
Format example: PDB
HEADER SIGNALLING PROTEIN 16-MAR-00 1E0A
TITLE CDC42 COMPLEXED WITH THE GTPASE BINDING DOMAIN OF P21
TITLE 2 ACTIVATED KINASE
COMPND MOL_ID: 1;
COMPND 2 MOLECULE: G25K GTP-BINDING PROTEIN, PLACENTAL ISOFORM (GP),
COMPND 3 CDC42 HOMOLOG;
COMPND 4 CHAIN: A;
COMPND 5 FRAGMENT: 1-184;
COMPND 6 ENGINEERED: YES;
COMPND 7 MUTATION: YES;
COMPND 8 OTHER_DETAILS: COMPLEXED WITH 5'-GUANOSYL-IMIDO-TRIPHOSPHATE;
...
SHEET 1 A1 6 PHE A 110 VAL A 113 0
SHEET 2 A1 6 VAL A 77 PHE A 82 1 O PHE A 78 N LEU A 111
SHEET 3 A1 6 THR A 3 VAL A 9 1 O VAL A 7 N LEU A 79
...
MODEL 1
ATOM 1 N MET A 1 -12.147 13.950 13.828 1.00 0.00 N
ATOM 2 CA MET A 1 -10.775 14.199 14.342 1.00 0.00 C
ATOM 3 C MET A 1 -9.722 13.837 13.300 1.00 0.00 C
ATOM 4 O MET A 1 -8.711 13.211 13.617 1.00 0.00 O
ATOM 5 CB MET A 1 -10.571 13.369 15.610 1.00 0.00 C
CleanUp: SOURCE names
$COLI
COLI
E. COLI
E.COLI
ESCHERCHIA COLI
ESCHERICHI $COLI
ESCHERICHIA $ COLI
ESCHERICHIA $COLI
ESCHERICHIA COLI
ESCHERICHIA COLI.
EXCHERICHIA COLI
EXPRESCHERICHIA COLI
CleanUp: Source

Example: SOURCE EXPRESSION_SYSTEM:

From 8396 Entries where there are 3573


EXPRESSION_SYSTEM records, all spelling
mistakes and variations of input have been
corrected to give a final list of just 34 possible
systems.
Summary: PDB
• Data archive http://www.rcsb.org/ + others

• Format definition HTML and text versions from


http://www.rcsb.org/
FILE FORMATS

• Toolkits many
(CCP4; http://www.ccp4.ac.uk)

• Contents
– Mainly co-ordinate data. Some experimental and validation data,
particularly for crystal structures.
Format example: NMR STAR
save_entry_information
_Saveframe_category entry_information

_Entry_title
;
Structure of Cdc42 bound to the GTPase Binding Domain of PAK
;

loop_
_Author_ordinal
_Author_family_name
_Author_given_name
_Author_middle_initials
_Author_family_title

1 Morreale Angela . .
2 Venkatesan Meenakshi . .

stop_

save_
...
Contents of BMRB
Data type Protein DNA RNA
All Chemical Shifts 689335 (1947) 6792 (42) 4744 (16)
1
H Chemical Shifts 438977 (1796) 5991 (42) 3064 (16)
15
N Chemical Shifts 67031 (628) 45 (2) 267 (7)
13
C Chemical Shifts 183430 (632) 562 (6) 1367 (7)

Coupling constants 4115 (63) 84 (2) -

Dipolar couplings 753 (5) - -

T1 1259 (6) - -

T2 1275 (6) - -

Heteronuclear nOe 1380 (7) - -

order parameter 765 (6) - -


Summary: NMR-STAR
• Data archive http://www.bmrb.wisc.edu/

• Format definition dictionary under preparation

• Toolkits
– C, Java http://www.bmrb.wisc.edu/
Tools Software Library

• Contents
– Mostly chemical shifts. Some couplings, relaxation studies and 1H
exchange data.
Format example: mmCIF
_struct.entry_id 1E0A
_struct.title 'CDC42 COMPLEXED WITH THE GTPASE BINDING DOMAIN OF P21 ...’
...
loop_
_entity.id
_entity.type
_entity.src_method
_entity.pdbx_description
_entity.formula_weight
_entity.pdbx_number_of_molecules
_entity.details
1 polymer man
'G25K GTP-BINDING PROTEIN, PLACENTAL ISOFORM (GP), CDC42 HOMOLOG' 20453.834 1
;COMPLEXED WITH 5'-GUANOSYL-IMIDO-TRIPHOSPHATE
;
2 polymer nat 'SERINE/THREONINE-PROTEIN KINASE PAK-ALPHA'
5098.682 1 ?
3 non-polymer syn 'MAGNESIUM ION'
24.305 1 ?
4 non-polymer syn 'PHOSPHOAMINOPHOSPHONIC ACID-GUANYLATE ESTER'
522.198 1 ?
#
Format example: mmCIF (2)
loop_
_atom_site.group_PDB
_atom_site.id
_atom_site.type_symbol
_atom_site.label_atom_id
_atom_site.label_alt_id
_atom_site.label_comp_id
_atom_site.label_asym_id
_atom_site.label_entity_id
_atom_site.label_seq_id
_atom_site.pdbx_PDB_ins_code
_atom_site.Cartn_x
_atom_site.Cartn_y
_atom_site.Cartn_z
_atom_site.occupancy
_atom_site.B_iso_or_equiv
_atom_site.auth_seq_id
_atom_site.auth_comp_id
_atom_site.auth_asym_id
_atom_site.auth_atom_id
_atom_site.pdbx_auth_atom_name
_atom_site.pdbx_PDB_model_num
ATOM 1 N N ? MET A 1 1 ? -12.147 13.950 13.828 1.00 0.00
? 1 MET A N N 1
Summary: mmCIF
• Data archive
ftp://beta.rcsb.org/pub/pdb/uniformity/data/mmCIF/

• Format definition model and metamodel

• Toolkits
– C, PERL http://pdb.rutgers.edu/mmcif

• Contents
– Predominantly the same data as PDB but richer in some
places and internally consistent. Can include X-ray structure
factors
Harnessing database technologies
Relational Database
Interface (SQL)

server

database
Degrees of meta-ness
meta model
(metametadata)

model
(metadata)

data
From abstract model to reality
abstract model

format definition

datafile
Chemistry organisation

Exp. Result Assembly Chains Residues Atoms


Simplified schema for system
Describe as a data model (e.g. UML)

Assembly Chain Residue Atom


Simplified schema for system (2)
Generalised Generalised
Chain Residue

Assembly Chain Residue Atom

Reference Reference
Residue Atom
Normalised database
~ 410 tables, 2000 attributes

deposition data
~ 260 tables, 1300 attributes

provides standard is used to


values for calculate

reference data derived data


~ 130 tables, 500 attributes ~ 20 tables, 200 attributes
Pattern of generation
Abstract model

Implementation
Building different databases form
the same abstract model
• Database described is highly normalised (each
piece of information present only once)
– Very good for deposition where performance is not
critical but integrity is
– Not good for searching as it takes so long to
reconstitute the links

• Therefore build a denormalised ‘data warehouse’


What Is A Data Warehouse

A Data Warehouse is simply a different way of


thinking about (and therefore organising) data
• same data, different representation
• not transactional, but for reporting and analysis
• read-only, except for automatic updating
Integrating NMR: CCPN
CCPN NMR
model harvest system
molecules
all other
conc.
data
pH

NMR data spectrum structure


machine processing analysis calculation databases

temp. processing assignment parameters relations


experiment parameters & & &
constraints constraints indices
From abstract model to reality
abstract model

format definition

datafile
model driven architecture
abstract model

SQL table
XML DTD python C
definitions

XML
database
Current CCPN architecture
abstract model

SQL table
XML DTD python C
definitions

XML
database
CCPN NMR
test harvest system
molecules
all other
conc.
data
pH

NMR data spectrum structure


machine processing analysis calculation databases

temp. processing assignment parameters relations


experiment parameters & & &
constraints constraints indices
Structural Genomics
Target Selection

Cloning Expression and Purification

Structure Determination

Archiving and Annotation

Analysis of the Data


Accessing NMR data
• NMR structures are not used as widely as
they should be
– No figure of merit
– No standardisation of presentation
• handling of representative structures
• handling of ensembles
– Experimental data very hard to reuse
• free format
• already interpreted
• many different methodologies
Does NMR mean "Not for Molecular Replacement" ? Using
NMR-based search models to solve protein crystal
structures.
Chen, Y.W., Dodson, E.J. and Kleywegt, G.J.

Structure 8, R213-R220 (2000)


PDB_SELECT
Uwe Hobohm and Chris Sander

Non redundant set of structures (based on


sequence similarity).

• 1334 structures

• 332 NMR
• 1002 XRAY
Fold Recognition

• Aim to find protein of known three


dimensional structure that is similar to
protein for which we only know the
sequence.
• NMR can generate detailed and useful data
about solution structure and dynamics not
obtainable by other methods

• During structural genomics analysis of the


data is decoupled from the structure
determination process

• Improved archiving is important


Conformation Mobility
also …

Rasmus Fogh
Wayne Boucher
Ernest Laue

TN Bhat
Eldon Ulrich

You might also like