You are on page 1of 34

COMPUTATIONAL BIOLOGY

B.Tech – BioTech (VIth Semester)

Module 1
WHAT is a database?
A collection of data that needs to be:
• Structured
• Searchable
• Updated (periodically)
• Cross referenced

Challenge:
To change “meaningless” data into useful information that can be
accessed and analysed the best way possible.

For example:
HOW would YOU organise all biological sequences so that the
biological information is optimally accessible?

You need an appropriate database management system (DBMS)


Why Databases?

• Means to handle and share large volume of biological data

• Supports large-scale analysis softwares

• Make data access easy and updated

• Links knowledge obtained from various fields of biology


and medicine
Biological databases
• Like any other database
– Data organization for optimal analysis
– Large library of life science information, usually associated
with certain kinds of softwares, to access, interpret, for the
analysis the data.

• Data is of different types


– Raw data (DNA, RNA, protein sequences)
– Curated data (DNA, RNA and protein annotated sequences
and structures, expression data)
Raw Biological data

Nucleic Acids (DNA)


Raw Biological data

Amino acid residues (proteins)


Curated Biological Data
DNA, nucleotide sequences

Gene boundaries, topology Gene structure

Introns, exons, ORFs, splicing

Expression data Mass spectometry


Curated Biological Data
Proteins, residue sequences
Mass spectometry
Extended sequence information (metabolomics, proteomics)
MCTUYTCUYFSTYRCCTYFSCD

Secondary structure

Post-Translational protein
Modification (PTM)

Hydrophobicity, motif data

Protein-protein interaction
Curated Biological data

3D Structures Folds
Data Domains
• Types of data generated by molecular biology
research:
– Nucleotide sequences (DNA and mRNA)
– Protein sequences
– 3-D protein structures
– Complete genomes and maps

• Also now have:


– Gene expression
– Genetic variation (polymorphisms)
Types of Databases - By Scope
• Comprehensive
– Contain data from many organisms and many different types of
sequences. Examples:
 Nucleotide
• GenBank
• EMBL: European Molecular Biology Laboratory
• DDBJ: DNA Data Bank of Japan

The three databases above comprise the


International Nucleotide Sequence Database Collaboration and currently include sequence
• Comprehensive

 Protein, such as Swiss-Prot


 Protein Structure, such as PDB: Protein Data Bank
 Genomes and Maps, such as Entrez Genomes

• Specialized
 Contain data from individual organisms
 Specific categories/functions of sequences
 Data generated by specific sequencing technologies
Types of Databases - By Level of Curation

• Archival data
– Repository of information
– Redundant
– Submitters maintain editorial control over their records
– No controlled vocabulary
– Variation in annotation of biological features

• Curated data
– Non-redundant
– Each record is intended to present an encapsulation of the current
understanding of a gene or protein
– Records contain value-added information that have been added by an
expert(s)

13
Different Database Types

• Depends on the nature of information stored (sequences, 2D


gel or 3D structure images)

• Manner of storage (flat files, tables in a relational database, etc)


Types of Biological Databases Accessible

There are many different types of database but for routine


sequence analysis, the following are initially the most
important:

 Primary databases

 Secondary databases

 Composite databases
Primary Databases
Contain sequence data such as nucleic acid or protein

Example of primary databases include :


 Protein Databases
• SWISS-PROT
• Tr-EMBL
• PIR

 Nucleic Acid Databases


• EMBL
• Genbank
• DDBJ
SWISS-PROT
 Swiss-Prot is not a repository database
 Founder: Amos Bairoch
 Established and maintained collaboratively since 1987, by the Department
of Medical Biochemistry of the University of Geneva and EBI
 Best annotated database
 Complete, Curated, Non-redundant and cross-referenced with 34 other
databases
 Highly cross-referenced
 Available from a variety of servers and through sequence analysis software
tools
 More than 8,000 different species
 First 20 species represent about 42% of all sequences in the database
 More than 1,29,000 entries with 4.7 X 1010 amino acids
 More than 6,22,000 entries in TrEMBL
Swiss-Prot File Format
 Structure of Swiss-Prot entries:
• General Information
• Bibliographic Information
• Functional Information
• Sequence

 Entry begins with (ID) i.e. identification line

 Entry terminates with (//) terminator symbol


• General Information
– Entry name: (ID line) Format: PROTEIN_SOURCE
– Primary Accession Number: (AC line)
– Secondary Accession Number
– Date Line: (DT line)
– Protein name: (DE line)
– Synonyms
– Gene Name: (GE line)
– Organism: (OS line)
– Organism Classification: (OC line)

• Bibliographic Information (RN, RP, RL…..)


List of bibliographic references that are used to build the current entry
• Comments (CC)
– Function
– Catalytic activity
– Subcellular Location
– Alternative Products
– Tissue Specificity
– Miscellaneous
– Similarity

• Cross-reference: (DR line)

• Keywords: (KW line)


• Features: (FT line/s)
– Protein is organized in a table
KEY | | FROM | | TO | | LENGTH | | DESCRIPTION

– KEY section further contains:


• SIGNAL
• CHAIN
• DOMAIN
• TRANSMEM
• REPEAT

• Final Section/ Sequence Section: (SQ line)


– Single letter code
– 60 residues per line
RA NEDWIN G.E., NAYLOR S.L., SAKAGUCHI A.Y., SMITH D.H.,
Swiss-Prot File Example: RA JARRETT-NEDWIN J., PENNICA D., GOEDDEL D.V., GRAY P.W.;
RL NUCLEIC ACIDS RES. 13:6361-6373(1985).
ID TNFA_HUMAN STANDARD; PRT; 233 AA. CC -!- FUNCTION: CYTOKINE WITH A WIDE VARIETY OF FUNCTIONS: IT CAN
AC P01375; CC CAUSE CYTOLYSIS OF CERTAIN TUMOR CELL LINES, IT IS IMPLICATED
DT 21-JUL-1986 (REL. 01, CREATED) CC IN THE INDUCTION OF CACHEXIA, IT IS A POTENT PYROGEN CAUSING
CC FEVER BY DIRECT ACTION OR BY STIMULATION OF IL-1 SECRETION, IT
DT 21-JUL-1986 (REL. 01, LAST SEQUENCE UPDATE)
CC CAN STIMULATE CELL PROLIFERATION & INDUCE CELL DIFFERENTIATION
DT 01-FEB-1995 (REL. 31, LAST ANNOTATION UPDATE) CC UNDER CERTAIN CONDITIONS.
DE TUMOR NECROSIS FACTOR PRECURSOR (TNF-ALPHA) (CACHECTIN). CC -!- SUBUNIT: HOMOTRIMER.
GN TNFA. CC -!- SUBCELLULAR LOCATION: TYPE II MEMBRANE PROTEIN. ALSO EXISTS AS
OS HOMO SAPIENS (HUMAN). DR EMBL; X02910; HSTNFA.
OC EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; DR EMBL; M16441; HSTNFAB.
DR EMBL; X01394; HSTNFR.
MAMMALIA;
DR EMBL; M10988; HSTNFAA.
OC EUTHERIA; PRIMATES. DR PDB; 2TUN; 31-JAN-94.
RN [1] DR MIM; 191160; 11TH EDITION.
RP SEQUENCE FROM N.A. DR PROSITE; PS00251; TNF.
RX MEDLINE; 87217060. KW CYTOKINE; CYTOTOXIN; TRANSMEMBRANE; GLYCOPROTEIN; SIGNAL-ANCHOR;
RA NEDOSPASOV S.A., SHAKHOV A.N., TURETSKAYA R.L., METT V.A., KW MYRISTYLATION; 3D-STRUCTURE.
FT PROPEP 1 76
RA AZIZOV M.M., GEORGIEV G.P., KOROBKO V.G., DOBRYNIN V.N.,
FT CHAIN 77 233 TUMOR NECROSIS FACTOR.
RA FILIPPOV S.A., BYSTROV N.S., BOLDYREVA E.F., CHUVPILO S.A., FT TRANSMEM 36 56 SIGNAL-ANCHOR (TYPE-II PROTEIN).
RA CHUMAKOV A.M., SHINGAROVA L.N., OVCHINNIKOV Y.A.; FT LIPID 19 19 MYRISTATE.
RL COLD SPRING HARB. SYMP. QUANT. BIOL. 51:611-624(1986). FT LIPID 20 20 MYRISTATE.
RN [2] FT DISULFID 145 177
RP SEQUENCE FROM N.A. FT MUTAGEN 108 108 R->W: BIOLOGICALLY INACTIVE.
FT MUTAGEN 112 112 L->F: BIOLOGICALLY INACTIVE.
RX MEDLINE; 85086244.
FT STRAND 89 93
RA PENNICA D., NEDWIN G.E., HAYFLICK J.S., SEEBURG P.H., DERYNCK R., FT TURN 99 100
RA PALLADINO M.A., KOHR W.J., AGGARWAL B.B., GOEDDEL D.V.; FT TURN 109 110
RL NATURE 312:724-729(1984). FT STRAND 112 113
RN [3] FT TURN 115 116
RP SEQUENCE FROM N.A. FT STRAND 118 119
FT STRAND 124 125
RX MEDLINE; 85137898.
FT STRAND 218 218
RA SHIRAI T., YAMAGUCHI H., ITO H., TODD C.W., WALLACE R.B.; FT STRAND 227 232
RL NATURE 313:803-806(1985). SQ SEQUENCE 233 AA; 25644 MW; 279986 CN;
RN [4] MSTESMIRDV ELAEEALPKK TGGPQGSRRC LFLSLFSFLI VAGATTLFCL LHFGVIGPQR
EEFPRDLSLI SPLAQAVRSS SRTPSDKPVA HVVANPQAEG QLQWLNRRAN ALLANGVELR
RP SEQUENCE FROM N.A. DNQLVVPSEG LYLIYSQVLF KGQGCPSTHV LLTHTISRIA VSYQTKVNLL SAIKSPCQRE
RX MEDLINE; 86016093. TPEGAEAKPW YEPIYLGGVF QLEKGDRLSA EINRPDYLDF AESGQVYFGI IAL
//
Tr-EMBL

 Created in 1996
 Computer-annotated supplement to SWISS-PROT
 Translations of coding sequences (CDS) of EMBL database
maintained at the EBI, UK
 TrEMBL contains all what is not yet in SWISS-PROT
 Two sections:
– SP- TrEMBL
– REM- TrEMBL
PIR
 PIR stands for Protein Information Resource

 Established by Barker et al., 1998

 An association of macromolecular sequence data collection centres:


• Protein Information Resource
• International Protein Information Database of Japan (JIPID)
• Martinsried Institute for Protein Sequences (MIPS)

 Splits into four distinct sections:


• PIR1: fully classified and annotated entries
• PIR2: preliminary entries, not reviewed, may contain redundancy
• PIR3: unverified entries
• PIR4: further 4 entries
GENBANK

 GenBank, maintained by (NIH) is annotated collection o all


publicly available nucleotide sequences (DNA/RNA).
 Sequence records are derived from the sequencing of a
biological molecule that exists in a test tube, somewhere in a
lab
 All records are generated from direct submissions by original
authors
 Full release occurs on a bimonthly schedule
 Updates are available daily via FTP
 Exchange sequences with DDBJ and EMBL
 3 databases have common format
GENBANK FLATFILE (GBFF)
 GBFF is elementary unit of information
 Most commonly used formats
 GenBank files are grouped into divisions:
– Phylogentically based
– Technical information
 GBFF has three parts
– Header
– Features
– Nucleotide Sequence
 All records end with (//) on last line
 The Header
– Databse-specific part of record
– First line is LOCUS Line

LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999


 unique
The GenBank database is divided into 18 divisions:
1. PRI - primate sequences
  2. ROD - rodent sequences
  3. MAM - other mammalian sequences
  4. VRT - other vertebrate sequences
  5. INV - invertebrate sequences
  6. PLN - plant, fungal, and algal sequences
  7. BCT - bacterial sequences
  8. VRL - viral sequences
  9. PHG - bacteriophage sequences
10. SYN - synthetic sequences
11. UNA - unannotated sequences
12. EST - EST sequences (expressed sequence tags)
13. PAT - patent sequences
14. STS - STS sequences (sequence tagged sites)
15. GSS - GSS sequences (genome survey sequences)
16. HTG - HTG sequences (high-throughput genomic sequences)
17. HTC - unfinished high-throughput cDNA sequencing
18. ENV - environmental sampling sequences
 The DEFINITION Line:
Syntax:
DEFINITION Homo sapiens myosin heavy mRNA, complete cds

 ACCESSION Line:
– Unique identifier for a sequence record
– Usually combination of a letter(s) and numbers (U12345 or
AF123456)

 VERSION Line:
Syntax: VERSION U49845.1 GI:1293613
VERSION: Unique identifier, Any change to the sequence data (even a single base),
the version number will be increased, e.g., U12345.1 → U12345.2
GI: GenInfo Identifier, assigned to each protein translation within a nucleotide
sequence record
 KEYWORD Line: Word or phrase describing the sequence
 SOURCE Line: abbreviated form of the organism name
 Organism name
 REFERENCES
 Author
 Journal
 Title
 Pubmed

 FEATURES
Information about genes and gene products, as well as regions of biological
significance reported in the sequence. These can include regions of the sequence that
code for proteins and RNA molecules, as well as a number of other features
- Source
- Gene
- CDS
 ORIGIN
The sequence data begin on the line immediately below ORIGIN. To view/save the
sequence data only, display the record in FASTA format
EMBL entry: example
ID HSERPG standard; DNA; HUM; 3398 BP.
XX
AC X02158;
XX
SV X02158.1
XX
DT 13-JUN-1985 (Rel. 06, Created)
DT 22-JUN-1993 (Rel. 36, Last updated, Version 2)
XX
DE Human gene for erythropoietin
XX
KW erythropoietin; glycoprotein hormone; hormone; signal peptide. keyword
XX
OS Homo sapiens (human) taxonomy
OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;
OC Eutheria; Primates; Catarrhini; Hominidae; Homo.
XX
RN [1]
RP 1-3398
RX MEDLINE; 85137899. references
RA Jacobs K., Shoemaker C., Rudersdorf R., Neill S.D., Kaufman R.J.,
RA Mufson A., Seehra J., Jones S.S., Hewick R., Fritsch E.F., Kawakita M.,
RA Shimizu T., Miyake T.;
RT Isolation and characterization of genomic and cDNA clones of human Cross-references
RT erythropoietin;
RL Nature 313:806-810(1985).
XX
DR GDB; 119110; EPO.
DR GDB; 119615; TIMP1.
DR SWISS-PROT; P01588; EPO_HUMAN.
XX

FT /organism=Homo sapiens
FT mRNA join(397..627,1194..1339,1596..1682,2294..2473,2608..3327)
FT CDS join(615..627,1194..1339,1596..1682,2294..2473,2608..2763)
FT /db_xref=SWISS-PROT:P01588
FT /product=erythropoietin
FT /protein_id=CAA26095.1
FT /translation=MGVHECPAWLWLLLSLLSLPLGLPVLGAPPRLICDSRVLQRYLLE
FT AKEAENITTGCAEHCSLNENITVPDTKVNFYAWKRMEVGQQAVEVWQGLALLSEAVLRG
FT QALLVNSSQPWEPLQLHVDKAVSGLRSLTTLLRALGAQKEAISPPDAASAAPLRTITAD
FT TFRKLFRVYSNFLRGKLKLYTGEACRTGDR
FT mat_peptide join(1262..1339,1596..1682,2294..2473,2608..2763)
FT /product=erythropoietin
FT sig_peptide join(615..627,1194..1261)
FT exon 397..627
FT /number=1
FT intron 628..1193
FT /number=1
FT exon 1194..1339
FT /number=2
FT intron 1340..1595
FT /number=2
FT exon 1596..1682
FT /number=3 annotation
FT intron 1683..2293
FT /number=3
FT exon 2294..2473
FT /number=4
FT intron 2474..2607
FT /number=4
FT exon 2608..3327
FT /note=3' untranslated region
FT /number=5
XX
SQ Sequence 3398 BP; 698 A; 1034 C; 991 G; 675 T; 0 other;
agcttctggg cttccagacc cagctacttt gcggaactca gcaacccagg catctctgag 60
tctccgccca agaccgggat gccccccagg aggtgtccgg gagcccagcc tttcccagat 120 sequence
Secondary Databases
 Sometimes known as pattern databases
 Contain results from the analysis of the sequences in the
primary databases
 Example of secondary databases include :
• PROSITE
• Pfam
• BLOCKS
• PRINTS
• IDENTIFY
Single motif Fuzzy regex
methods (IDENTIFY)

Exact regex
(PROSITE) Full domain alignment
methods

Profiles
(PROFILE LIBRARY)

HMMs
(Pfam)

Identity matrices
(PRINTS)

Multiple motif Weight matrices


methods (BLOCKS)

You might also like