Professional Documents
Culture Documents
Module 1
WHAT is a database?
A collection of data that needs to be:
• Structured
• Searchable
• Updated (periodically)
• Cross referenced
Challenge:
To change “meaningless” data into useful information that can be
accessed and analysed the best way possible.
For example:
HOW would YOU organise all biological sequences so that the
biological information is optimally accessible?
Secondary structure
Post-Translational protein
Modification (PTM)
Protein-protein interaction
Curated Biological data
3D Structures Folds
Data Domains
• Types of data generated by molecular biology
research:
– Nucleotide sequences (DNA and mRNA)
– Protein sequences
– 3-D protein structures
– Complete genomes and maps
• Specialized
Contain data from individual organisms
Specific categories/functions of sequences
Data generated by specific sequencing technologies
Types of Databases - By Level of Curation
• Archival data
– Repository of information
– Redundant
– Submitters maintain editorial control over their records
– No controlled vocabulary
– Variation in annotation of biological features
• Curated data
– Non-redundant
– Each record is intended to present an encapsulation of the current
understanding of a gene or protein
– Records contain value-added information that have been added by an
expert(s)
13
Different Database Types
Primary databases
Secondary databases
Composite databases
Primary Databases
Contain sequence data such as nucleic acid or protein
Created in 1996
Computer-annotated supplement to SWISS-PROT
Translations of coding sequences (CDS) of EMBL database
maintained at the EBI, UK
TrEMBL contains all what is not yet in SWISS-PROT
Two sections:
– SP- TrEMBL
– REM- TrEMBL
PIR
PIR stands for Protein Information Resource
ACCESSION Line:
– Unique identifier for a sequence record
– Usually combination of a letter(s) and numbers (U12345 or
AF123456)
VERSION Line:
Syntax: VERSION U49845.1 GI:1293613
VERSION: Unique identifier, Any change to the sequence data (even a single base),
the version number will be increased, e.g., U12345.1 → U12345.2
GI: GenInfo Identifier, assigned to each protein translation within a nucleotide
sequence record
KEYWORD Line: Word or phrase describing the sequence
SOURCE Line: abbreviated form of the organism name
Organism name
REFERENCES
Author
Journal
Title
Pubmed
FEATURES
Information about genes and gene products, as well as regions of biological
significance reported in the sequence. These can include regions of the sequence that
code for proteins and RNA molecules, as well as a number of other features
- Source
- Gene
- CDS
ORIGIN
The sequence data begin on the line immediately below ORIGIN. To view/save the
sequence data only, display the record in FASTA format
EMBL entry: example
ID HSERPG standard; DNA; HUM; 3398 BP.
XX
AC X02158;
XX
SV X02158.1
XX
DT 13-JUN-1985 (Rel. 06, Created)
DT 22-JUN-1993 (Rel. 36, Last updated, Version 2)
XX
DE Human gene for erythropoietin
XX
KW erythropoietin; glycoprotein hormone; hormone; signal peptide. keyword
XX
OS Homo sapiens (human) taxonomy
OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;
OC Eutheria; Primates; Catarrhini; Hominidae; Homo.
XX
RN [1]
RP 1-3398
RX MEDLINE; 85137899. references
RA Jacobs K., Shoemaker C., Rudersdorf R., Neill S.D., Kaufman R.J.,
RA Mufson A., Seehra J., Jones S.S., Hewick R., Fritsch E.F., Kawakita M.,
RA Shimizu T., Miyake T.;
RT Isolation and characterization of genomic and cDNA clones of human Cross-references
RT erythropoietin;
RL Nature 313:806-810(1985).
XX
DR GDB; 119110; EPO.
DR GDB; 119615; TIMP1.
DR SWISS-PROT; P01588; EPO_HUMAN.
XX
…
FT /organism=Homo sapiens
FT mRNA join(397..627,1194..1339,1596..1682,2294..2473,2608..3327)
FT CDS join(615..627,1194..1339,1596..1682,2294..2473,2608..2763)
FT /db_xref=SWISS-PROT:P01588
FT /product=erythropoietin
FT /protein_id=CAA26095.1
FT /translation=MGVHECPAWLWLLLSLLSLPLGLPVLGAPPRLICDSRVLQRYLLE
FT AKEAENITTGCAEHCSLNENITVPDTKVNFYAWKRMEVGQQAVEVWQGLALLSEAVLRG
FT QALLVNSSQPWEPLQLHVDKAVSGLRSLTTLLRALGAQKEAISPPDAASAAPLRTITAD
FT TFRKLFRVYSNFLRGKLKLYTGEACRTGDR
FT mat_peptide join(1262..1339,1596..1682,2294..2473,2608..2763)
FT /product=erythropoietin
FT sig_peptide join(615..627,1194..1261)
FT exon 397..627
FT /number=1
FT intron 628..1193
FT /number=1
FT exon 1194..1339
FT /number=2
FT intron 1340..1595
FT /number=2
FT exon 1596..1682
FT /number=3 annotation
FT intron 1683..2293
FT /number=3
FT exon 2294..2473
FT /number=4
FT intron 2474..2607
FT /number=4
FT exon 2608..3327
FT /note=3' untranslated region
FT /number=5
XX
SQ Sequence 3398 BP; 698 A; 1034 C; 991 G; 675 T; 0 other;
agcttctggg cttccagacc cagctacttt gcggaactca gcaacccagg catctctgag 60
tctccgccca agaccgggat gccccccagg aggtgtccgg gagcccagcc tttcccagat 120 sequence
Secondary Databases
Sometimes known as pattern databases
Contain results from the analysis of the sequences in the
primary databases
Example of secondary databases include :
• PROSITE
• Pfam
• BLOCKS
• PRINTS
• IDENTIFY
Single motif Fuzzy regex
methods (IDENTIFY)
Exact regex
(PROSITE) Full domain alignment
methods
Profiles
(PROFILE LIBRARY)
HMMs
(Pfam)
Identity matrices
(PRINTS)