Bioinformatics and Structural Genomics: John Ionides

Bioinformatics
and
structural genomics
John Ionides
Barton Group Tools
(http://www.compbio.dundee.ac.uk/)
• JalView multiple sequence alignment analysis

• Jnet secondary structure prediction for globular proteins
• SCANPS sequence search against protein or nucleic acid
database
• STAMP comparison and alignment of protein three
dimensional structures
… and many others

G
E D
L H FUNCTION?
I Y S M
...A A
K...
Protein Sequence
-helix
-strand
Secondary Structure
Fold
Topics
• Context
• Archiving structural information
– Data harvesting
• The structure of the archive
– Current status
– Harnessing database technologies
• Integrating NMR: CCPN
• Accessing NMR data
– Validation criteria
EBI- all major databases
Megabases
Genome information
NMR + XRAY
entries in PDB
Structural information
Structural Genomics
Naive aim: Determine the three dimensional structure of all

proteins in a genome.
More Realistic aim: Determine the 3D structure of most globular

proteins in a genome.
Realistic aim: Determine the 3D structure of the easy to express,

purify and crystallise proteins in a genome.
7 Pilot projects in the USA. Planned projects in the UK and

Europe.
Structural Genomics
Target Selection
Cloning Expression and Purification
Structure Determination
Archiving and Annotation
Analysis of the Data

Human Genome ->Human
Proteome
• 30 000 to 60 000 protein coding genes
• Number of different proteins larger-
– alternative splicing (30 to 50% increase);
– post-translational modifications (5 to 10 fold increase)
• > 100 000 different transcripts and > 1 000 000 different protein
molecules in the human body
• There is not one human transcriptome or proteome but many

human transcriptomes and proteomes, depending on individual
genomic background, developmental stage, tissue and cell type,
and environmental influences
Archiving structural data
data flow between archives
AutoDep
EBI
pdb,
mmCIF NMRSTAR
BMRB
pdb RCSB
ADIT
Secondary structure,
disulphide bonds, cis linkages
• No longer requested during deposition.
• Determined automatically during

processing
– Can then be changed during review proccess
Het Groups
• Want to standardise representations of small

molecules within the structural archive
• When a molecule is deposited it will

automatically be allocated the correct three
letter code and atom names
Matching HET groups
• Need to take into account of missing atoms
O O O
O O O O O
O O
O
O O
O
• Both should be picked up as –glucose and
labelled as GLC
Technology: Matching
• Graph Matching
A procedure for finding common parts in two or
more graphs
Graph-graph Graph-subgraph Subgraph-subgraph

matching matching matching
Technology: History
 Classical Approach
Looking for a maximal clique of an associated graph;
Best algorithm: by C.Bron and J.Kerbosch (1971)
Space consumption: O((mn)2)
Best case complexity: O(mn)
Worst case complexity: O((mn)n), m  n
Advantage: suitable for both graph-(sub)graph and
subgraph-subgraph matching
Disadvantage: prohibitive performance at n,m>20
Technology: Ullman
• Advanced technique
Enumeration scheme with forward checking for dead-end branches;
Best algorithm: by J.R.Ullman (1976)
Space consumption: O(mn2)
Best case complexity: O(mn)
Worst case complexity: O(mnn2) , m  n
Advantage: considerably outperforms the maximal
clique algorithm for n,m>10
Disadvantage: allows only for graph-(sub)graph
matching
Development
• Subgraph-subgraph matching
Enumeration scheme with forward checking for dead-end branches
(Ullman’s algorithm) has been extended for the subgraph-subgraph
matching.
Space consumption: O(mn2)
Best case complexity: O(mn2)
Worst case complexity: O(mnn3)
Controllable complexity: the actual performance depends on the
required quality of match and for graph-(sub)graph
matching reduces to that of Ullman’s algorithm
Implementation: HetGroups
• Recognition of chemical compounds
Graph representation for all chemical compounds, currently contained
in Oracle database, is compiled into fast-accessible binary files.
Performance of compilation: 2400 compounds / min
A graph matching utility allows to match given compound against the

whole data set as well as against individual precompiled compounds.
Performance of matching: up to 32000 compounds / min
Language: C++ with FORTRAN interface, UNIX

The precompiled binary files also contain easy-accessible information
on atom and compound names, synonyms, chemical formulas,
charges and leaving atoms.
Data harvesting
Much of the data that would be useful to

archive is known by software during the
structure determination process
• Each piece of software used during the

determination of a structure should
contribute to an information pool that can
later form the basis of the archive
XRAY project outline
data
data structure model building
processing & databases
collection solution & refinement
reduction
XRAY data harvesting: ideal
relations
&
indices
data
reduction
sample processing processing all other

molecules
details parameters parameters data
XRAY data harvesting: current
relations
&
indices
data
reduction
sample processing processing all other

molecules
details parameters parameters data
The structure of the archive
data flow between archives
AutoDep
EBI
pdb,
mmCIF NMRSTAR
BMRB
pdb RCSB
ADIT
Format example: PDB
HEADER SIGNALLING PROTEIN 16-MAR-00 1E0A
TITLE CDC42 COMPLEXED WITH THE GTPASE BINDING DOMAIN OF P21
TITLE 2 ACTIVATED KINASE
COMPND MOL_ID: 1;
COMPND 2 MOLECULE: G25K GTP-BINDING PROTEIN, PLACENTAL ISOFORM (GP),
COMPND 3 CDC42 HOMOLOG;
COMPND 4 CHAIN: A;
COMPND 5 FRAGMENT: 1-184;
COMPND 6 ENGINEERED: YES;
COMPND 7 MUTATION: YES;
COMPND 8 OTHER_DETAILS: COMPLEXED WITH 5'-GUANOSYL-IMIDO-TRIPHOSPHATE;
...
SHEET 1 A1 6 PHE A 110 VAL A 113 0
SHEET 2 A1 6 VAL A 77 PHE A 82 1 O PHE A 78 N LEU A 111
SHEET 3 A1 6 THR A 3 VAL A 9 1 O VAL A 7 N LEU A 79
...
MODEL 1
ATOM 1 N MET A 1 -12.147 13.950 13.828 1.00 0.00 N
ATOM 2 CA MET A 1 -10.775 14.199 14.342 1.00 0.00 C
ATOM 3 C MET A 1 -9.722 13.837 13.300 1.00 0.00 C
ATOM 4 O MET A 1 -8.711 13.211 13.617 1.00 0.00 O
ATOM 5 CB MET A 1 -10.571 13.369 15.610 1.00 0.00 C
CleanUp: SOURCE names
$COLI
COLI
E. COLI
E.COLI
ESCHERCHIA COLI
ESCHERICHI $COLI
ESCHERICHIA $ COLI
ESCHERICHIA $COLI
ESCHERICHIA COLI
ESCHERICHIA COLI.
EXCHERICHIA COLI
EXPRESCHERICHIA COLI
CleanUp: Source
Example: SOURCE EXPRESSION_SYSTEM:
From 8396 Entries where there are 3573

EXPRESSION_SYSTEM records, all spelling
mistakes and variations of input have been
corrected to give a final list of just 34 possible
systems.
Summary: PDB
• Data archive http://www.rcsb.org/ + others
• Format definition HTML and text versions from

http://www.rcsb.org/
FILE FORMATS
• Toolkits many
(CCP4; http://www.ccp4.ac.uk)
• Contents
– Mainly co-ordinate data. Some experimental and validation data,
particularly for crystal structures.
Format example: NMR STAR
save_entry_information
_Saveframe_category entry_information
_Entry_title
;
Structure of Cdc42 bound to the GTPase Binding Domain of PAK
;
loop_
_Author_ordinal
_Author_family_name
_Author_given_name
_Author_middle_initials
_Author_family_title
1 Morreale Angela . .
2 Venkatesan Meenakshi . .
stop_
save_
...
Contents of BMRB
Data type Protein DNA RNA
All Chemical Shifts 689335 (1947) 6792 (42) 4744 (16)
1
H Chemical Shifts 438977 (1796) 5991 (42) 3064 (16)
15
N Chemical Shifts 67031 (628) 45 (2) 267 (7)
13
C Chemical Shifts 183430 (632) 562 (6) 1367 (7)
Coupling constants 4115 (63) 84 (2) -
Dipolar couplings 753 (5) - -
T1 1259 (6) - -
T2 1275 (6) - -
Heteronuclear nOe 1380 (7) - -
order parameter 765 (6) - -

Summary: NMR-STAR
• Data archive http://www.bmrb.wisc.edu/
• Format definition dictionary under preparation
• Toolkits
– C, Java http://www.bmrb.wisc.edu/
Tools Software Library
• Contents
– Mostly chemical shifts. Some couplings, relaxation studies and 1H
exchange data.
Format example: mmCIF
_struct.entry_id 1E0A
_struct.title 'CDC42 COMPLEXED WITH THE GTPASE BINDING DOMAIN OF P21 ...’
...
loop_
_entity.id
_entity.type
_entity.src_method
_entity.pdbx_description
_entity.formula_weight
_entity.pdbx_number_of_molecules
_entity.details
1 polymer man
'G25K GTP-BINDING PROTEIN, PLACENTAL ISOFORM (GP), CDC42 HOMOLOG' 20453.834 1
;COMPLEXED WITH 5'-GUANOSYL-IMIDO-TRIPHOSPHATE
;
2 polymer nat 'SERINE/THREONINE-PROTEIN KINASE PAK-ALPHA'
5098.682 1 ?
3 non-polymer syn 'MAGNESIUM ION'
24.305 1 ?
4 non-polymer syn 'PHOSPHOAMINOPHOSPHONIC ACID-GUANYLATE ESTER'
522.198 1 ?
#
Format example: mmCIF (2)
loop_
_atom_site.group_PDB
_atom_site.id
_atom_site.type_symbol
_atom_site.label_atom_id
_atom_site.label_alt_id
_atom_site.label_comp_id
_atom_site.label_asym_id
_atom_site.label_entity_id
_atom_site.label_seq_id
_atom_site.pdbx_PDB_ins_code
_atom_site.Cartn_x
_atom_site.Cartn_y
_atom_site.Cartn_z
_atom_site.occupancy
_atom_site.B_iso_or_equiv
_atom_site.auth_seq_id
_atom_site.auth_comp_id
_atom_site.auth_asym_id
_atom_site.auth_atom_id
_atom_site.pdbx_auth_atom_name
_atom_site.pdbx_PDB_model_num
ATOM 1 N N ? MET A 1 1 ? -12.147 13.950 13.828 1.00 0.00
? 1 MET A N N 1
Summary: mmCIF
• Data archive
ftp://beta.rcsb.org/pub/pdb/uniformity/data/mmCIF/
• Format definition model and metamodel
• Toolkits
– C, PERL http://pdb.rutgers.edu/mmcif
• Contents
– Predominantly the same data as PDB but richer in some
places and internally consistent. Can include X-ray structure
factors
Harnessing database technologies
Relational Database
Interface (SQL)
server
database
Degrees of meta-ness
meta model
(metametadata)
model
(metadata)
data
From abstract model to reality
abstract model
format definition
datafile
Chemistry organisation
Exp. Result Assembly Chains Residues Atoms

Simplified schema for system
Describe as a data model (e.g. UML)
Assembly Chain Residue Atom

Simplified schema for system (2)
Generalised Generalised
Chain Residue
Assembly Chain Residue Atom
Reference Reference
Residue Atom
Normalised database
~ 410 tables, 2000 attributes
deposition data
~ 260 tables, 1300 attributes
provides standard is used to

values for calculate
reference data derived data

~ 130 tables, 500 attributes ~ 20 tables, 200 attributes
Pattern of generation
Abstract model
Implementation
Building different databases form
the same abstract model
• Database described is highly normalised (each
piece of information present only once)
– Very good for deposition where performance is not
critical but integrity is
– Not good for searching as it takes so long to
reconstitute the links
• Therefore build a denormalised ‘data warehouse’

What Is A Data Warehouse
A Data Warehouse is simply a different way of

thinking about (and therefore organising) data
• same data, different representation
• not transactional, but for reporting and analysis
• read-only, except for automatic updating
Integrating NMR: CCPN
CCPN NMR
model harvest system
molecules
all other
conc.
data
pH
NMR data spectrum structure

machine processing analysis calculation databases
temp. processing assignment parameters relations

experiment parameters & & &
constraints constraints indices
From abstract model to reality
abstract model
format definition
datafile
model driven architecture
abstract model
SQL table
XML DTD python C
definitions
XML
database
Current CCPN architecture
abstract model
SQL table
XML DTD python C
definitions
XML
database
CCPN NMR
test harvest system
molecules
all other
conc.
data
pH
NMR data spectrum structure

machine processing analysis calculation databases
temp. processing assignment parameters relations

experiment parameters & & &
constraints constraints indices
Structural Genomics
Target Selection
Cloning Expression and Purification
Structure Determination
Archiving and Annotation
Analysis of the Data

Accessing NMR data
• NMR structures are not used as widely as
they should be
– No figure of merit
– No standardisation of presentation
• handling of representative structures
• handling of ensembles
– Experimental data very hard to reuse
• free format
• already interpreted
• many different methodologies
Does NMR mean "Not for Molecular Replacement" ? Using
NMR-based search models to solve protein crystal
structures.
Chen, Y.W., Dodson, E.J. and Kleywegt, G.J.
Structure 8, R213-R220 (2000)

PDB_SELECT
Uwe Hobohm and Chris Sander
Non redundant set of structures (based on

sequence similarity).
• 1334 structures
• 332 NMR
• 1002 XRAY
Fold Recognition
• Aim to find protein of known three

dimensional structure that is similar to
protein for which we only know the
sequence.
• NMR can generate detailed and useful data
about solution structure and dynamics not
obtainable by other methods
• During structural genomics analysis of the

data is decoupled from the structure
determination process
• Improved archiving is important

Conformation Mobility
also …
Rasmus Fogh
Wayne Boucher
Ernest Laue
TN Bhat
Eldon Ulrich

Bioinformatics and Structural Genomics: John Ionides

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bioinformatics and Structural Genomics: John Ionides

Uploaded by

Copyright:

Available Formats

Bioinformatics

• JalView multiple sequence alignment analysis

… and many others

Naive aim: Determine the three dimensional structure of all

More Realistic aim: Determine the 3D structure of most globular

Realistic aim: Determine the 3D structure of the easy to express,

7 Pilot projects in the USA. Planned projects in the UK and

Cloning Expression and Purification

Archiving and Annotation

Analysis of the Data

• There is not one human transcriptome or proteome but many

• No longer requested during deposition.

• Determined automatically during

• Want to standardise representations of small

• When a molecule is deposited it will

• Need to take into account of missing atoms

Graph-graph Graph-subgraph Subgraph-subgraph

A graph matching utility allows to match given compound against the

Language: C++ with FORTRAN interface, UNIX

Much of the data that would be useful to

• Each piece of software used during the

sample processing processing all other

sample processing processing all other

Example: SOURCE EXPRESSION_SYSTEM:

From 8396 Entries where there are 3573

• Format definition HTML and text versions from

Coupling constants 4115 (63) 84 (2) -

Dipolar couplings 753 (5) - -

Heteronuclear nOe 1380 (7) - -

order parameter 765 (6) - -

• Format definition dictionary under preparation

• Format definition model and metamodel

Exp. Result Assembly Chains Residues Atoms

Assembly Chain Residue Atom

Assembly Chain Residue Atom

provides standard is used to

reference data derived data

• Therefore build a denormalised ‘data warehouse’

A Data Warehouse is simply a different way of

NMR data spectrum structure

temp. processing assignment parameters relations

NMR data spectrum structure

temp. processing assignment parameters relations

Cloning Expression and Purification

Archiving and Annotation

Analysis of the Data

Structure 8, R213-R220 (2000)

Non redundant set of structures (based on

• Aim to find protein of known three

• During structural genomics analysis of the

• Improved archiving is important

You might also like