You are on page 1of 30

DATA BASES IN BIOINFORMATICS

Biological databases: why


Need for storing and communicating large datasets has grown Make biological data available to scientists To make biological data available in computer-readable form

Data in bioinformatics
Type of data:

nucleotide sequences protein sequences 3D structures gene expression data metabolic pathways data deposited directly curators add and update data treatment of erroneous data: removed or marked error checking consistency, updates Primary databases: direct experimental results Secondary databases: result of analysis on primary databases Consolidation of many databases

Data entry and quality control:

Primary, or derived data:

DATA BASES
Organization:
flat files Relational databases Object-oriented databases Publicly available, no restriction Available, but with copyright Accessible, but not downloadable Academic, but not freely available Commercial Large, public institution (EMBL, NCBI) Quasi-academic institute (Swiss institute of Bioinformatics, TIGR,) Academic group or scientist Commercial company

Availability:

Curators:

Identifiers and Accession numbers


Identifier: string of letters and digits that generally is understandable
Example: TPIS_CHICK (Triose Phosphate Isomerase from chicken (gallus gallus) ) in SwissProt The identifier can change (based on the curator)

Accession code: a string of letters and digits that uniquely identifies an entry in its database.
The accession number for TPIS_CHICK in Swissprot is P00940 Accession number should not changed

Biological databases
INFOBIOGEN Catalog of Databases Type of database DNA RNA Protein Genomic Mapping Protein structure Literature Miscellaneous Total No of records 87 29 94 58 29 18 43 153 511

Literature databases PubMed (MedLine)


1. It contains entries for more than 11 million abstracts of scientific publications. 2. It enables user to do keyword searches, provides links to a selection of full articles, and has text mining capabilities, e.g. provides links to related articles, and GenBank entries, among others. 3. Efficient searching PubMed requires some skill. For example, searching with a keyword interleukin returns 108,366 matches.

PubMed web-site (http://www3.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed )

Nucleic Acids databases


What info are in these databases:
DNA sequence, genes, gene products (proteins), mutation, gene coding, distribution patterns, motifs Genomics: genome, gene structure and expression, genetic map, genetic disorder RNA sequence, secondary structure, 3D structure, interactions

Nucleotide Sequence Databases


3 main databases
EMBL (1982): www.ebi.ac.uk/embl GenBank (1982): www.ncbi.nlm.nih.gov/GenBank DDBJ (1986): www.ddbj.nig.ac.jp
The 3 databases are synchronized on a daily basis, and the accession numbers are consistent There are no legal restriction in the usage of these databases However, there are some patented sequences in the database

DNA databases
EMBL nucleotide sequence database
(http://www.ebi.ac.uk/embl/)
Public release was in June, 1982 with 568 entries.

Contains nucleotide sequences collected from all public sources.


Feb, 2004 it contained 30,351,263 entries (comprising 36,042,464,651 nucleotides)

EMBL s split into divisions, mostly taxomonic (e.g. prokaryotes, fungi, plants, mammals), others EST (expressed seq. tags), STSs (sequence tagged sites), GSSs (genome survey sequences), HTG (high-throughput genomic data) Accessible through Sequence Retrieval System (SRS) which allows keyword searching (studied later) Sequence similarity search tools: Blitz, Fasta, and BLAST (studied later)

GenBank database
(http://www.ncbi.nih.gov/Genbank/)
First public release was in Dec, 1982, with 606 entries. Direct submission, large sequencing projects, united state patents and trademark office Information can be retrieved from GenBank using NCBIs Entrez retrieval system. Contains publicly available DNA sequences from more than 100,000 organisms. Also contains derived protein sequences, and annotations describing biological, structural, and other relevant features. Sequence similarity search tools: BLAST (studied later)

DNA databases:

GenBank Web page

A GenBank entry HEADER

GenBank entry - FEATURES

GenBank Entry Links provided in the Feature section


LocusID locus and display of genomic and mRNA sequences MIM Link to OMIM description, other entries for this
sequence

EC_number link to the corresponding cataloged enzymes Protein_id retrieve protein record from GenPept CD conserved protein domain (SMART), CDD conserved protein domain (Pfam).

GenBank - SEQUENCE

International Nucleotide Sequence Database


Streamline and standardize the process of data collection Feb, 1986 EMBL and GenBank together with DDBJ joined force New and update entries are exchanged between them daily via the internet

EMBL EBI, Europe

GenBank NCBI, USA

INSD

DDBJ NIG, Japan

Protein databases
1. SWISS-PROT (http://us.expasy.org/sprot/sprot-top.html) is a curated database focusing on high level of annotation (sequence, function, structure, post-translational modifications, variants, etc.) of proteins.
TrEMBL is Computer-annotated supplement to SWISS-PROT

2.

Protein databases
What are in these databases: Protein sequence, corresponding gene, secondary structure, 3D structure, function, motifs, homology, interactions Proteomics: expression profile, proteins in disease processes etc. Ligands and drugs (inhibitors, activators, substrates, metabolites)

Protein sequence Data Bases


1965 Margaret Dayhoffs first publihed collection of sequences Atlas of Protein sequence and structure 1984, Dayhoffs atlas was released electronically, changing its name to Protein Sequence Database (PSD) of Protein Identification Recourse (PIR) Developed by National Biomedical Research Foundation (NBRF) PIR-NREF contained 1.485,025 non-redundant sequences from PIRPSD, Swiss-Prot, TrEMBL, RefSeq, GenPept and PDB Sequence identity, overlapping length, domain arrangement.

SWISS-PROT
SWISS-PROT provides high-quality annotations and detailed info about sequence, structural, functional, and other properties of proteins. It provides a rich set of links to other sources of information on SWISS-PROT entries. Unfortunately, some of the links will not work at all times, because of the dynamical change of the Web. It also provides a rich set of protein analysis tools

web-page

SWISS-PROT entry P00709

Protein Family DataBases


Pattern or secondary databases Derived from the analysis of the sequences in the primary sources Several different sources and a variety of ways of analyzing sequences, the information housed in each of the family database is different
Family Database PROSITE Profiles PRINTS Pfame Blocks eMOTIF Data source Swiss-Prot Swiss-Prot Swiss-Prot Swiss-Prot InterPro/PRINTS Blocks/PRINTS Stored Information Regular expression (patterns) Weighted matrices (profiles) Aligned motifs (fingerprinting) Hidden Markov models Permissive regular Expressions (patterns)

Protein Structure Data Bases


These are limited to the relatively few 3D structures generated by crystallographic and spectroscopic studies
These resources can be divided into those that house the actual 3D coordinates of solved structures and those that classify and summarize the structures of those proteins These include the PDB (protein data bank), SCOP (structural classification of proteins), CATH (Class, Architechture, Topology, Homology).

THANK YOU

You might also like