Professional Documents
Culture Documents
Data in bioinformatics
Type of data:
nucleotide sequences protein sequences 3D structures gene expression data metabolic pathways data deposited directly curators add and update data treatment of erroneous data: removed or marked error checking consistency, updates Primary databases: direct experimental results Secondary databases: result of analysis on primary databases Consolidation of many databases
DATA BASES
Organization:
flat files Relational databases Object-oriented databases Publicly available, no restriction Available, but with copyright Accessible, but not downloadable Academic, but not freely available Commercial Large, public institution (EMBL, NCBI) Quasi-academic institute (Swiss institute of Bioinformatics, TIGR,) Academic group or scientist Commercial company
Availability:
Curators:
Accession code: a string of letters and digits that uniquely identifies an entry in its database.
The accession number for TPIS_CHICK in Swissprot is P00940 Accession number should not changed
Biological databases
INFOBIOGEN Catalog of Databases Type of database DNA RNA Protein Genomic Mapping Protein structure Literature Miscellaneous Total No of records 87 29 94 58 29 18 43 153 511
DNA databases
EMBL nucleotide sequence database
(http://www.ebi.ac.uk/embl/)
Public release was in June, 1982 with 568 entries.
EMBL s split into divisions, mostly taxomonic (e.g. prokaryotes, fungi, plants, mammals), others EST (expressed seq. tags), STSs (sequence tagged sites), GSSs (genome survey sequences), HTG (high-throughput genomic data) Accessible through Sequence Retrieval System (SRS) which allows keyword searching (studied later) Sequence similarity search tools: Blitz, Fasta, and BLAST (studied later)
GenBank database
(http://www.ncbi.nih.gov/Genbank/)
First public release was in Dec, 1982, with 606 entries. Direct submission, large sequencing projects, united state patents and trademark office Information can be retrieved from GenBank using NCBIs Entrez retrieval system. Contains publicly available DNA sequences from more than 100,000 organisms. Also contains derived protein sequences, and annotations describing biological, structural, and other relevant features. Sequence similarity search tools: BLAST (studied later)
DNA databases:
EC_number link to the corresponding cataloged enzymes Protein_id retrieve protein record from GenPept CD conserved protein domain (SMART), CDD conserved protein domain (Pfam).
GenBank - SEQUENCE
INSD
Protein databases
1. SWISS-PROT (http://us.expasy.org/sprot/sprot-top.html) is a curated database focusing on high level of annotation (sequence, function, structure, post-translational modifications, variants, etc.) of proteins.
TrEMBL is Computer-annotated supplement to SWISS-PROT
2.
Protein databases
What are in these databases: Protein sequence, corresponding gene, secondary structure, 3D structure, function, motifs, homology, interactions Proteomics: expression profile, proteins in disease processes etc. Ligands and drugs (inhibitors, activators, substrates, metabolites)
SWISS-PROT
SWISS-PROT provides high-quality annotations and detailed info about sequence, structural, functional, and other properties of proteins. It provides a rich set of links to other sources of information on SWISS-PROT entries. Unfortunately, some of the links will not work at all times, because of the dynamical change of the Web. It also provides a rich set of protein analysis tools
web-page
THANK YOU