Data Bases - Lecture 1

DATA BASES IN BIOINFORMATICS
Biological databases: why

Need for storing and communicating large datasets has grown Make biological data available to scientists To make biological data available in computer-readable form
Data in bioinformatics
Type of data:
nucleotide sequences protein sequences 3D structures gene expression data metabolic pathways data deposited directly curators add and update data treatment of erroneous data: removed or marked error checking consistency, updates Primary databases: direct experimental results Secondary databases: result of analysis on primary databases Consolidation of many databases
Data entry and quality control:
Primary, or derived data:
DATA BASES
Organization:
flat files Relational databases Object-oriented databases Publicly available, no restriction Available, but with copyright Accessible, but not downloadable Academic, but not freely available Commercial Large, public institution (EMBL, NCBI) Quasi-academic institute (Swiss institute of Bioinformatics, TIGR,) Academic group or scientist Commercial company
Availability:
Curators:
Identifiers and Accession numbers

Identifier: string of letters and digits that generally is understandable
Example: TPIS_CHICK (Triose Phosphate Isomerase from chicken (gallus gallus) ) in SwissProt The identifier can change (based on the curator)
Accession code: a string of letters and digits that uniquely identifies an entry in its database.
The accession number for TPIS_CHICK in Swissprot is P00940 Accession number should not changed
Biological databases
INFOBIOGEN Catalog of Databases Type of database DNA RNA Protein Genomic Mapping Protein structure Literature Miscellaneous Total No of records 87 29 94 58 29 18 43 153 511
Literature databases PubMed (MedLine)

1. It contains entries for more than 11 million abstracts of scientific publications. 2. It enables user to do keyword searches, provides links to a selection of full articles, and has text mining capabilities, e.g. provides links to related articles, and GenBank entries, among others. 3. Efficient searching PubMed requires some skill. For example, searching with a keyword interleukin returns 108,366 matches.
PubMed web-site (http://www3.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed )
Nucleic Acids databases

What info are in these databases:
DNA sequence, genes, gene products (proteins), mutation, gene coding, distribution patterns, motifs Genomics: genome, gene structure and expression, genetic map, genetic disorder RNA sequence, secondary structure, 3D structure, interactions
Nucleotide Sequence Databases

3 main databases
EMBL (1982): www.ebi.ac.uk/embl GenBank (1982): www.ncbi.nlm.nih.gov/GenBank DDBJ (1986): www.ddbj.nig.ac.jp
The 3 databases are synchronized on a daily basis, and the accession numbers are consistent There are no legal restriction in the usage of these databases However, there are some patented sequences in the database
DNA databases
EMBL nucleotide sequence database
(http://www.ebi.ac.uk/embl/)
Public release was in June, 1982 with 568 entries.
Contains nucleotide sequences collected from all public sources.

Feb, 2004 it contained 30,351,263 entries (comprising 36,042,464,651 nucleotides)
EMBL s split into divisions, mostly taxomonic (e.g. prokaryotes, fungi, plants, mammals), others EST (expressed seq. tags), STSs (sequence tagged sites), GSSs (genome survey sequences), HTG (high-throughput genomic data) Accessible through Sequence Retrieval System (SRS) which allows keyword searching (studied later) Sequence similarity search tools: Blitz, Fasta, and BLAST (studied later)
GenBank database
(http://www.ncbi.nih.gov/Genbank/)
First public release was in Dec, 1982, with 606 entries. Direct submission, large sequencing projects, united state patents and trademark office Information can be retrieved from GenBank using NCBIs Entrez retrieval system. Contains publicly available DNA sequences from more than 100,000 organisms. Also contains derived protein sequences, and annotations describing biological, structural, and other relevant features. Sequence similarity search tools: BLAST (studied later)
DNA databases:
GenBank Web page
A GenBank entry HEADER
GenBank entry - FEATURES
GenBank Entry Links provided in the Feature section

LocusID locus and display of genomic and mRNA sequences MIM Link to OMIM description, other entries for this
sequence
EC_number link to the corresponding cataloged enzymes Protein_id retrieve protein record from GenPept CD conserved protein domain (SMART), CDD conserved protein domain (Pfam).
GenBank - SEQUENCE
International Nucleotide Sequence Database

Streamline and standardize the process of data collection Feb, 1986 EMBL and GenBank together with DDBJ joined force New and update entries are exchanged between them daily via the internet
EMBL EBI, Europe
GenBank NCBI, USA
INSD
DDBJ NIG, Japan
Protein databases
1. SWISS-PROT (http://us.expasy.org/sprot/sprot-top.html) is a curated database focusing on high level of annotation (sequence, function, structure, post-translational modifications, variants, etc.) of proteins.
TrEMBL is Computer-annotated supplement to SWISS-PROT
2.
Protein databases
What are in these databases: Protein sequence, corresponding gene, secondary structure, 3D structure, function, motifs, homology, interactions Proteomics: expression profile, proteins in disease processes etc. Ligands and drugs (inhibitors, activators, substrates, metabolites)
Protein sequence Data Bases

1965 Margaret Dayhoffs first publihed collection of sequences Atlas of Protein sequence and structure 1984, Dayhoffs atlas was released electronically, changing its name to Protein Sequence Database (PSD) of Protein Identification Recourse (PIR) Developed by National Biomedical Research Foundation (NBRF) PIR-NREF contained 1.485,025 non-redundant sequences from PIRPSD, Swiss-Prot, TrEMBL, RefSeq, GenPept and PDB Sequence identity, overlapping length, domain arrangement.
SWISS-PROT
SWISS-PROT provides high-quality annotations and detailed info about sequence, structural, functional, and other properties of proteins. It provides a rich set of links to other sources of information on SWISS-PROT entries. Unfortunately, some of the links will not work at all times, because of the dynamical change of the Web. It also provides a rich set of protein analysis tools
web-page
SWISS-PROT entry P00709
Protein Family DataBases

Pattern or secondary databases Derived from the analysis of the sequences in the primary sources Several different sources and a variety of ways of analyzing sequences, the information housed in each of the family database is different
Family Database PROSITE Profiles PRINTS Pfame Blocks eMOTIF Data source Swiss-Prot Swiss-Prot Swiss-Prot Swiss-Prot InterPro/PRINTS Blocks/PRINTS Stored Information Regular expression (patterns) Weighted matrices (profiles) Aligned motifs (fingerprinting) Hidden Markov models Permissive regular Expressions (patterns)
Protein Structure Data Bases

These are limited to the relatively few 3D structures generated by crystallographic and spectroscopic studies
These resources can be divided into those that house the actual 3D coordinates of solved structures and those that classify and summarize the structures of those proteins These include the PDB (protein data bank), SCOP (structural classification of proteins), CATH (Class, Architechture, Topology, Homology).
THANK YOU

Data Bases - Lecture 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Bases - Lecture 1

Uploaded by

Copyright:

Available Formats

DATA BASES IN BIOINFORMATICS

Biological databases: why

Data entry and quality control:

Primary, or derived data:

Identifiers and Accession numbers

Literature databases PubMed (MedLine)

PubMed web-site (http://www3.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed )

Nucleic Acids databases

Nucleotide Sequence Databases

Contains nucleotide sequences collected from all public sources.

GenBank Web page

A GenBank entry HEADER

GenBank entry - FEATURES

GenBank Entry Links provided in the Feature section

International Nucleotide Sequence Database

EMBL EBI, Europe

GenBank NCBI, USA

DDBJ NIG, Japan

Protein sequence Data Bases

SWISS-PROT entry P00709

Protein Family DataBases

Protein Structure Data Bases

You might also like