Professional Documents
Culture Documents
and
structural genomics
John Ionides
Barton Group Tools
(http://www.compbio.dundee.ac.uk/)
-helix
-strand
Secondary Structure
Fold
Topics
• Context
• Archiving structural information
– Data harvesting
• The structure of the archive
– Current status
– Harnessing database technologies
• Integrating NMR: CCPN
• Accessing NMR data
– Validation criteria
EBI- all major databases
Megabases
Genome information
NMR + XRAY
entries in PDB
Structural information
Structural Genomics
Structure Determination
• > 100 000 different transcripts and > 1 000 000 different protein
molecules in the human body
EBI
pdb,
mmCIF NMRSTAR
BMRB
pdb RCSB
ADIT
Secondary structure,
disulphide bonds, cis linkages
O O O
O O O O O
O O
O
O O
O
• Both should be picked up as –glucose and
labelled as GLC
Technology: Matching
• Graph Matching
A procedure for finding common parts in two or
more graphs
data
data structure model building
processing & databases
collection solution & refinement
reduction
XRAY data harvesting: ideal
relations
&
indices
data
data structure model building
processing & databases
collection solution & refinement
reduction
data
data structure model building
processing & databases
collection solution & refinement
reduction
EBI
pdb,
mmCIF NMRSTAR
BMRB
pdb RCSB
ADIT
Format example: PDB
HEADER SIGNALLING PROTEIN 16-MAR-00 1E0A
TITLE CDC42 COMPLEXED WITH THE GTPASE BINDING DOMAIN OF P21
TITLE 2 ACTIVATED KINASE
COMPND MOL_ID: 1;
COMPND 2 MOLECULE: G25K GTP-BINDING PROTEIN, PLACENTAL ISOFORM (GP),
COMPND 3 CDC42 HOMOLOG;
COMPND 4 CHAIN: A;
COMPND 5 FRAGMENT: 1-184;
COMPND 6 ENGINEERED: YES;
COMPND 7 MUTATION: YES;
COMPND 8 OTHER_DETAILS: COMPLEXED WITH 5'-GUANOSYL-IMIDO-TRIPHOSPHATE;
...
SHEET 1 A1 6 PHE A 110 VAL A 113 0
SHEET 2 A1 6 VAL A 77 PHE A 82 1 O PHE A 78 N LEU A 111
SHEET 3 A1 6 THR A 3 VAL A 9 1 O VAL A 7 N LEU A 79
...
MODEL 1
ATOM 1 N MET A 1 -12.147 13.950 13.828 1.00 0.00 N
ATOM 2 CA MET A 1 -10.775 14.199 14.342 1.00 0.00 C
ATOM 3 C MET A 1 -9.722 13.837 13.300 1.00 0.00 C
ATOM 4 O MET A 1 -8.711 13.211 13.617 1.00 0.00 O
ATOM 5 CB MET A 1 -10.571 13.369 15.610 1.00 0.00 C
CleanUp: SOURCE names
$COLI
COLI
E. COLI
E.COLI
ESCHERCHIA COLI
ESCHERICHI $COLI
ESCHERICHIA $ COLI
ESCHERICHIA $COLI
ESCHERICHIA COLI
ESCHERICHIA COLI.
EXCHERICHIA COLI
EXPRESCHERICHIA COLI
CleanUp: Source
• Toolkits many
(CCP4; http://www.ccp4.ac.uk)
• Contents
– Mainly co-ordinate data. Some experimental and validation data,
particularly for crystal structures.
Format example: NMR STAR
save_entry_information
_Saveframe_category entry_information
_Entry_title
;
Structure of Cdc42 bound to the GTPase Binding Domain of PAK
;
loop_
_Author_ordinal
_Author_family_name
_Author_given_name
_Author_middle_initials
_Author_family_title
1 Morreale Angela . .
2 Venkatesan Meenakshi . .
stop_
save_
...
Contents of BMRB
Data type Protein DNA RNA
All Chemical Shifts 689335 (1947) 6792 (42) 4744 (16)
1
H Chemical Shifts 438977 (1796) 5991 (42) 3064 (16)
15
N Chemical Shifts 67031 (628) 45 (2) 267 (7)
13
C Chemical Shifts 183430 (632) 562 (6) 1367 (7)
T1 1259 (6) - -
T2 1275 (6) - -
• Toolkits
– C, Java http://www.bmrb.wisc.edu/
Tools Software Library
• Contents
– Mostly chemical shifts. Some couplings, relaxation studies and 1H
exchange data.
Format example: mmCIF
_struct.entry_id 1E0A
_struct.title 'CDC42 COMPLEXED WITH THE GTPASE BINDING DOMAIN OF P21 ...’
...
loop_
_entity.id
_entity.type
_entity.src_method
_entity.pdbx_description
_entity.formula_weight
_entity.pdbx_number_of_molecules
_entity.details
1 polymer man
'G25K GTP-BINDING PROTEIN, PLACENTAL ISOFORM (GP), CDC42 HOMOLOG' 20453.834 1
;COMPLEXED WITH 5'-GUANOSYL-IMIDO-TRIPHOSPHATE
;
2 polymer nat 'SERINE/THREONINE-PROTEIN KINASE PAK-ALPHA'
5098.682 1 ?
3 non-polymer syn 'MAGNESIUM ION'
24.305 1 ?
4 non-polymer syn 'PHOSPHOAMINOPHOSPHONIC ACID-GUANYLATE ESTER'
522.198 1 ?
#
Format example: mmCIF (2)
loop_
_atom_site.group_PDB
_atom_site.id
_atom_site.type_symbol
_atom_site.label_atom_id
_atom_site.label_alt_id
_atom_site.label_comp_id
_atom_site.label_asym_id
_atom_site.label_entity_id
_atom_site.label_seq_id
_atom_site.pdbx_PDB_ins_code
_atom_site.Cartn_x
_atom_site.Cartn_y
_atom_site.Cartn_z
_atom_site.occupancy
_atom_site.B_iso_or_equiv
_atom_site.auth_seq_id
_atom_site.auth_comp_id
_atom_site.auth_asym_id
_atom_site.auth_atom_id
_atom_site.pdbx_auth_atom_name
_atom_site.pdbx_PDB_model_num
ATOM 1 N N ? MET A 1 1 ? -12.147 13.950 13.828 1.00 0.00
? 1 MET A N N 1
Summary: mmCIF
• Data archive
ftp://beta.rcsb.org/pub/pdb/uniformity/data/mmCIF/
• Toolkits
– C, PERL http://pdb.rutgers.edu/mmcif
• Contents
– Predominantly the same data as PDB but richer in some
places and internally consistent. Can include X-ray structure
factors
Harnessing database technologies
Relational Database
Interface (SQL)
server
database
Degrees of meta-ness
meta model
(metametadata)
model
(metadata)
data
From abstract model to reality
abstract model
format definition
datafile
Chemistry organisation
Reference Reference
Residue Atom
Normalised database
~ 410 tables, 2000 attributes
deposition data
~ 260 tables, 1300 attributes
Implementation
Building different databases form
the same abstract model
• Database described is highly normalised (each
piece of information present only once)
– Very good for deposition where performance is not
critical but integrity is
– Not good for searching as it takes so long to
reconstitute the links
format definition
datafile
model driven architecture
abstract model
SQL table
XML DTD python C
definitions
XML
database
Current CCPN architecture
abstract model
SQL table
XML DTD python C
definitions
XML
database
CCPN NMR
test harvest system
molecules
all other
conc.
data
pH
Structure Determination
• 1334 structures
• 332 NMR
• 1002 XRAY
Fold Recognition
Rasmus Fogh
Wayne Boucher
Ernest Laue
TN Bhat
Eldon Ulrich