Hop Indonesia

Bioinformatics – An Overview
Dr. Virendrakumar (Virendra) C. Bhavsar

Professor
Dean 2003-2008
Director, Advanced Computational Research Lab. 2000-10
Faculty of Computer Science
University of New Brunswick (UNB)
Fredericton, Canada
Visiting Professor
Center for Development of Advanced Computing (C-DAC)
Pune, India
Outline
• Introduction – UNB, C-DAC, Bioinformatics
• Genome – Genes, Proteomes, Evolution
• Databases and Information Retrieval
• Sequence Alignment and Phylogenetic trees
• Protein Structure and Drug Discovery
• Proteomics and Systems Biology
• Infrastructure: UNB and C-DAC
• Research Work at the University of New Brunswick and C-DAC
• Future
2
University of New Brunswick (UNB)
Faculty of Computer Science

The First “Faculty of CS” in Canada
University of New Brunswick

Fredericton, New Brunswick
Canada
Oldest English Language University in Canada
Established in 1785
4
5
Fredericton and UNB

7
Center for Development of Advanced

Computing (C-DAC)
India
8
History
1987
India requires Supercomputer for

Weather Forecasting
Government of USA refuses sale of

Supercomputer to India
The Government of India decides to

launch a national initiative for
development of indigenous
supercomputers
C-DAC: HPC : Evolution and

Main
Road Map Phase
PoC Garuda
100 Mbps
17 Locations
Garuda – Grid
Computing
Social Computing
2002-03 with participatory
2012-13
approach
1 PF
2010
2007
100 TF
10 TF
1998 PARAM Padma
Viable HPC business
computing environment
1994 PARAM 10000
Platform for User community
PARAM 9000 to interact/ collaborate
1991
PARAM 8000
Technology Denial
C-DAC Centres
• Headquarter
– Pune
• Centres
– Pune
– Knowledge Park, Bangalore
– Electronics City, Bangalore
– Chennai
– Delhi
– Hyderabad
– Kolkata
– Mohali
– Mumbai
C-DAC HQ
– Noida
– Thiruvananthapuram Centres
Total Manpower is 2100 across all the centres of C-DAC
C-DAC’s Thrust Areas
• High Performance Computing & Grid Computing

• Hardware, Software, Systems, Applications, Research, Technology, Infrastructure
• Multilingual Computing
• Tools, Fonts, Products, Solutions, Research, Technology Development
• Software Technologies
• OSS, Multimedia, ICT for masses, E-Governance, Geomatics
• Professional Electronics
• Digital Broadband, Wireless Systems, Network Technologies, Power Electronics, Real-Time
Systems, Embedded Systems, VLSI/ASIC Design, Agri Electronics
• Cyber Security & Cyber Forensics

• Cyber Security tools, technologies & solution development, Research & Training
• Health Informatics
• Hospital Information System, Telemedicine, Decision Support System
• Ubiquitous Computing
• RFID, Design, Development and Integration of Ubicomp System Components
• Education & Training

• e-Learning Technologies & Services
Compute Nodes
No. of Processors : 248 (Power 4 @ 1 GHz)
Aggregate Peak Computing : 1005 GFs (~1 TF)
File Servers
No. of Processors : 24 (UltraSparc-III@900MHz)
Aggregate Memory : 96 GigaBytes
Internal Storage : 0.4 TeraBytes
File System : QFS
Operating System : Solaris 8
Networks
Primary : PARAMNet-II @ 2.5 Gbps Full Duplex
Backup : Gigabit Ethernet @ 1 Gbps Full Duplex
Management : 10/100 MBPs Fast Ethernet
External Storage
Storage Array : 5 TeraBytes with 16 T3 disk arrays
Tape Library : 12 TeraBytes - L700 (5 LTO drives
Software
HPCC - C-DAC’s High performance computing and communication software suite
Compilers, Parallel Libraries and Tools
Ranked 171 in 2nd quarter end and 258 as per the latest ranking
C-DAC
Advanced Computing Training School (ACTS)
ACTS @ a glance
z An outfit initiated by C-DAC

R&D in 1993
z Begun with modest 20
students and grown to over
5000 students
z Trained more than quarter
million students
z Grown from one city one
centre to 30 cities and 50
centres within India
z Over 150 crores of investment
and 600 plus dedicated
manpower
z Spread from India to
International
z From One course to more
than 10 courses
International Presence
Azerbaijan Saudi Arabia Belarus
Russia
Tajikistan
Uzbekistan
Turkmenistan
Mauritius
Ghana
Armenia
Myanmar
Tanzania
Seychelles
Lesotho
Post Graduate Diploma Programs
Post Graduate Courses

DAC : Diploma in Advanced Computing
DACA : Diploma in Advanced Computer Arts
DVLSI : Diploma in VLSI Design
WiMC : Diploma in Wireless & Mobile Computing
DSSD : Diploma in System Software Development
DGi : Diploma in Geo informatics
DISCS : Diploma in Information System & Cyber Security
DHI : Diploma in Healthcare Informatics
DLC : Diploma in Language Computing
DIVESD: Diploma in Integrated VLSI & Embedded System
Design
DESD : Diploma in Embedded Systems Design
DPC : Diploma in Parallel Computing
M.Tech. Programs
Computer Science & Engineering

Software Engineering
Information Technology
VLSI
Artificial Intelligence
Grid Computing & Storage Management
Embedded Systems Design
Wireless & Network Technology
Process Control & Instrumentation
Training Programmes UNDER Tech sangam
Bioinformatics
20
Definitions
Bioinformatics
The creation and development of advanced
information and computational techniques for solving
problems in biology
and development of advanced information and
High Performance Computing (HPC)
Hardware and software for high speed computations
and large storageor solving problems in biology
21
“Bio” Introduction
22
Molecular Biology
inLiving
biology
organisms (on Earth)
Lipids - Separate inside from outside
Proteins – Build 3D machinery to perform biological
functions
DNA: Store information on how to build machinery (DNA)
Diagram of a cell
Lipid membranes - provide barrier
Protein structures - do work
DNA nucleus - store info
23
Molecular Biology
inDeoxyribonucleic
biology Acid (DNA)
Composition
- Sequence of nucleotides
0Nucleotide = deoxyribose sugar + phosphate group +
base
24
Molecular Biology - DNA
DNA: contains genetic instructions used in the
indevelopment
biology and functioning of all known living
organisms with the exception of some viruses.
DNA molecules: long-term storage of information.
DNA: a set of blueprints, like a recipe or a code, since it
contains the instructions needed to construct other
components of cells, such as proteins and RNA
molecules.
Genes: The DNA segments that contain instructions to
construct the above components of cells
Other DNA sequences: structural purposes, or are
involved in regulating the use of this genetic information.
Chemically, DNA consists of two long polymers of simple
units called nucleotides, with backbones made of sugars
and phosphate groups joined by ester bonds. These two
strands run in opposite directions to each other and are
therefore anti-parallel. Attached to each sugar is one of
four types of molecules called bases. It is the sequence
25
of these four bases along the backbone that encodes
i f ti Thi i f ti i d i th ti

- two long polymers of simple units called nucleotides,
inwith backbones made of sugars and phosphate groups
biology
joined by ester bonds.
- These two strands run in opposite directions to each
other and are therefore anti-parallel.
-Attached to each sugar is one of four types of molecules
called bases. It is the sequence of these four bases along
the backbone that encodes information. This information
is read using the genetic code, which specifies the
sequence of the amino acids within proteins.
-The code is read by copying stretches of DNA into the
related nucleic acid RNA, in a process called
transcription.
- Within cells, DNA is organized into long structures
called chromosomes. These chromosomes are
duplicated before cells divide, in a process called DNA
replication. Eukaryotic organisms (animals, plants, fungi,
and protists)
26
-DNA is organized into long structures called
chromosomes.
in biology
- Chromosomes are duplicated before cells divide, in a

process called DNA replication.
- Eukaryotic organisms (animals, plants, fungi, and

protists) store most of their DNA inside the cell nucleus
and some of their DNA in organelles, such as
mitochondria or chloroplasts.
- Prokaryotes (bacteria and archaea) store their DNA only

in the cytoplasm.
27
Molecular Biology
RNA: Ribonucleic acid (RNA)
in- biology
a long chain of nucleotide units
- Each nucleotide consists of a nitrogenous base, a
ribose sugar, and a phosphate
RNA is very similar to DNA
RNA is usually single-stranded
DNA is usually double-stranded
RNA nucleotides contain ribose while DNA contains
deoxyribose (a type of ribose that lacks one oxygen
atom)
RNA has the base uracil rather than thymine that is
present in DNA
28
Molecular Biology
DNA: DNA → DNA (Replication)
in biology
RNA: DNA → RNA (Transcription / Gene
Expression)
Protein: RNA → Protein (Translation)
29
DNA, RNA, Proteins

Proteins and nucleic acids (DNA, RNA) are essential
components for living organisms
DNA Transcription RNA Translation Proteins
(gene)
Chromosome
DNA
DNA
Gene 1 Gene 2 . . . .
Raw Biological data Nucleic Acids (DNA)
Raw Biological data

Amino acid residues (proteins)
Standard Genetic Code
T C A G
TTT Phe (F) TCT Ser (S) TAT Tyr (Y) TGT Cys (C)
TTC " TCC " TAC TGC
T
TTA Leu (L) TCA " TAA Ter TGA Ter
TTG " TCG " TAG Ter TGG Trp (W)
CTT Leu (L) CCT Pro (P) CAT His (H) CGT Arg (R)
CTC " CCC " CAC " CGC "
C
CTA " CCA " CAA Gln (Q) CGA "
CTG " CCG " CAG " CGG "
ATT Ile (I) ACT Thr (T) AAT Asn (N) AGT Ser (S)
ATC " ACC " AAC " AGC "
A
ATA " ACA " AAA Lys (K) AGA Arg (R)
ATG Met (M) ACG " AAG " AGG "
GTT Val (V) GCT Ala (A) GAT Asp (D) GGT Gly (G)
GTC " GCC " GAC " GGC "
G
GTA " GCA " GAA Glu (E) GGA "
GTG " GCG " GAG " GGG "
Triplets of DNA called ‘Codons’ code into a amino acid
AAProtein
ProteinStructure
Structure
Protein 3D structure
http://anatomy.med.unsw.edu.au/cbl/research/cytoskeleton/swissprotactin.htm
The structure of the protein sequence determines the

functionality
“Informatics”
36
FASTA formatted Sequences
FASTA: "FAST-All“ alignment; it works with any alphabet

- FAST-P for protein
- FAST-N for nucleotide alignment
Sample FASTA formatted Sequences

FASTA: "FAST-All“ alignment; it works with any alphabet, an
extension of "FAST-P" (protein) and "FAST-N" (nucleotide) alignment.
EST sequence (A, C, G, T)

>gi|39796586|gb|CK247430.1|CK247430 EST731067 potato callus cDNA library,
mRNA sequence
ACAAGTCACTATAGGGACATGCTTCAATTTTTTCAAAACATCTTGAATAGTACAAAGTGCACAACATACT
CCAAAAAACTGAATACATTTTCTATTGTCAATATCTATAGCCATATGACTTTCAGTGCGACCTATGCATT
CATAACTCCCGCTACCAAATCCACCATGTAGTGCTTACAACAACAAGCCTAGTGAGAACGTAAGCCTGGT
CTGGAGCCAAAAGCAAATTATGTATACTAAAAAACCCCCTGGCTAAAATGCATATCATGATTAGTAGTGA
CATT
Protein Sequence (20 different amino acids)

>gi|532319|pir|TVFV2E|TVFV2E envelope protein
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT
QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC
HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK
MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK
TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF
APTEVRRYTGGHERQKRVPF
Biological Databases
Genome databases – flat files or relational database
GenBank, EMBL, DDBJ, PDB, SWISSPROT, PIR
Classification of Biological databases:

- primary databases (GenBank, EMBL, DDBJ)
- secondary databases (SWISSPROT, PDB, PIR)
Biological databases
z Like any other database

z Data organization for optimal analysis
z Data is of different types

z Raw data (DNA, RNA, protein sequences)
z Curated data (DNA, RNA and protein
annotated sequences and structures,
expression data)
for solving problems in biology
41
Biological databases -Examples
z Nucleotide Databases
Alternative Splicing, EMBL-Bank, Ensembl, Genomes Server, Genome,
MOT, EMBL-Align, Simple Queries, dbSTS Queries, Parasites, Mutations,
IMGT
z Genome Databases
Human, Mouse, Yeast, C.elegans, FLYBASE, Parasites
z Protein Databases
Swiss-Prot, TrEMBL, InterPro, CluSTr, IPI, GOA, GO, Proteome Analysis,
HPI, IntEnz, TrEMBLnew, SP_ML, NEWT, PANDIT
z Structure Databases
PDB, MSD, FSSP, DALI
z Microarray Database
ArrayExpress
z Literature Databases
MEDLINE, Software Biocatalog, Flybase Archives
z Alignment Databases
BAliBASE, Homstrad, FSSP
PDB –Protein Data Bank
z 3D Macromolecular structural data
z Data originates from NMR or X-ray

crystallography techniques
z If the 3D structure of a protein is solved ...

they have it
What to take home
z Databases are a collection of data

z Need to access and maintain easily and flexibly
z Biological information is vast and sometimes
very redundant
z Distributed databases bring it all together with
quality controls, cross-referencing and
standardization
z Computers can only create data, they do not
give answers
“Bioinformatics”
45
Premise of Bioinformatics
Gene sequences determine biological function
Genomic DNA → Amino acids → Proteins → Function
Similar composition → similar function?

- DNA sequences
- Amino acid sequences
- Protein 3-D structure
Predicting protein function
- Designer drugs
- Personalized treatments solving problems in biology
46
Bioinformatics
Determining protein function
Hard way
-Biological / chemical analyses
- Determine 3D structure w/ x-ray crystallography, NMR
Easy way?
- Sequence protein / DNA → find close match in database
- Guess function based on match
- Validate guess in lab
Bioinformatics is imprecise
- Similar to data-mining
- Only suggests possible relationships
- Must validate correlation → causation
47
Growth of Bioinformatics
1970’s
- DNA sequencing
- Alignment w/ Smith-Waterman (dynamic programming)
1980’s
- Sequence databases (EMBL, GenBank)
- Alignment w/ FASTA (linked lists, hashing)
1990’s
- Automatic DNA sequencing
- Alignment w/ BLAST (neighborhood words, probabilities)
- Internet & WWW
Now
- Genomics, Proteomics
48
Bioinformatics Topics
Sequence alignments
- Find similarity between DNA / protein (amino acid) sequences
Genome assembly
- Combining genomic fragments to form whole genome
Gene identification & annotation
- Identify and classify genes on the genome
Microarrays & gene expression analysis
- Use DNA microarray (gene chip) to measure mRNA
Protein folding
- Compute 3-D protein structure ↔ protein sequence
Phylogenetic analysis
- Find genetic relationships between sequences and speciesbetween
between sequences / species
49
What Does Genomics Mean?

• “Genomics”: a science that studies the genetic
material of a species at the molecular level
• A scientific approach to identify and define the
function of genes, as well as uncover when and how
genes work together to produce traits
• “Structural Genomics” approaches (mapping) -
focus on traits controlled by one or a few genes, and
often only provide information regarding the
location of a gene or genes
• Examine the interrelationships and interactions
between thousands of genes
How do we do this?
Genome Organization
Chromosome
DNA
Leaf Tuber
Genome Organization
Proteins are building blocks for living organisms
Proteins are derived from DNA

transcription – the gene (RNA) that codes proteins is formed from DNA
Translation – RNA triplets (codons) code into amino acids
DNA Gene can also be known by finding complimentary (cDNA), the active
or expressed gene is termed as Expressed Sequence Tags (ESTs)
Chromosome
DNA
DNA
Gene 1 Gene 2 . . . .
Genome Organization
DNA
Gene 1 Gene 2 Etc.
....TATACAGCAAAATAGAAAGATCTAGTGTCCCATGGCGATGAGTCGTGTAGCTTCT….
Promoter Coding ORF

“Switch” “Message”
cDNA Collections (Libraries)
• Various tissues are collected from the plant,

and messages are extracted
Leaf
Messages
Tuber
Messages
• The messages are “copied” to form double-

stranded DNA copies (cDNA) of each message
Leaf cDNA Tuber cDNA
• Each copy is “glued” into a piece of bacterial DNA

for easier storage, handling and propagation,
resulting in a collection or “library” of cDNAs
for each tissue
• The cDNAs are then read or “sequenced”, to give the

order of A’s, C’s, G’s or T’s for each
• We are left with the sequence of each gene that is
active (expressed) in each cell, tissue or organ studies
• These are “Expressed Sequence Tags” or ESTs
• Using complex computer resources, these ESTs can
be analyzed and compared with known sequences
and proteins
• Look for messages associated with specific organs or
characteristic/traits
Take Home Points
• Messages from various genes are important,

as they dictate which proteins are produced
• Promoters are also important, as they dictate

where a specific message and protein is
produced
• “Genomics” involves the study of all of the

messages produced by the various plant cells
• A lot of information needs to be organized

and analyzed
Database
z Contains all the EST’s sequences

z Contains useful annotations
z Blast Searches
z Contig Assemblies
z Transmembrane Spanning Regions
z Gel Pictures
z EST Information
Data Analysis
• Tens of thousands of ESTs available for study

• Most methods to study message distributions are
low throughput AND time consuming
• “Genomics” necessitates the large scale study of
gene expression
How can we do this?
Microarray Analysis
Microarray Analysis
Microarray Analysis
Microarray Analysis
Microarray Analysis - Processing
Image Processing
Intensity Dependence Comparison
12
R2 = 0.6185
10
4 Slide3
Data Normalization
Log(R/G)
Slide70
R2 = 0.2014 Poly. (Slide70)

2 Poly. (Slide3)
0
0 2 4 6 8 10 12 14 16 18
-2
-4
-6
0.5*(Log(G) + Log(R))
Analysis
Differential
Cluster Pathway
Gene
Analysis Analysis
Expression

Signal
Background
z Irregular size or z Saturation

shape z Spot variance
z Irregular placement z Background variance
z Low intensity
indistinguishable saturated bad print miss alignment artifact

z Calculate numeric characteristics of each spot

z Throw out spots that do not meet minimum
requirements for each characteristic
z Throw out spots that do not have minimum
overall combined quality
Microarray Analysis - Data

Normalization
z Normalize data to correct for variances

z Dye bias
z Location bias
z Intensity bias
z Pin bias
z Slide bias
z Control vs. non-control spots
Microarray Analysis -Clustering
z Cluster genes based on expression profiles
z Gene expression across several treatments
z Hypothesis: Genes with similar function have

similar expression profiles
Expression Profile Clustering

Microarray Analysis - Data Management
Project
Database
Engine
Information Processing and Handling
• Assembly and annotation of genomic data
• EST analysis and databases
• Cluster analysis of microarray data
• Comparisons of various transcriptomic methods
• Integration of sequence, transcriptomic, proteomic,

metabolomic, transgenic data
Research Problems in Bioinformatics
Find genomes of all organisms
Identify and annotate all genes
Compute sequence <-> 3D structure for all proteins
Compare DNA / protein sequences for similarity
Compare families of DNA / protein sequences
Reason to be optimistic: Biology is finite…

~30,000 human genes; ~1000 protein superfamilies
…but computers speeds keep increasing
73
Fighting
FightingBird
BirdFlu
Flu
Virus
Virusin
in3-D
3-D
Bioinformatics Infrastructure – High

Performance Computing
76
Advances in Microprocessor Technology
1974 - 1 MHz clock
1988 – 40 MHz
2002 – 2 GHz
2009 – P4 3.0 GHz, Quadcore 2.66 MHz
Intel Montecito chip
1.72 Billion transistors
NVidia 280 series GPU 1.4 Billion transistors
- Circuit complexity doubles every 18 months

Æ Computing power at a given cost doubles every 18
months
- Processor clock rates: 40% increase/year + more
instr./cycle
- DRAM Access Times: 10% increase/year Æ caches
required
77
Current Supercomputer – Nov 2009

Jaguar
Oak Ridge National Lab., USA
- 1.72 Petaflop/s (Quadrilion): million billion (10**15)
floating-point operations/sec (Flops) on
Linpack benchmark
-2.332 Petaflops peak (.i.e 2332 Tera flops)
- Power – 1750 Watt/sq ft; ~50 million KWh per year
- Space – 4352 square feet, larger than NBA
basketball court
-
78
Jaguar
79

Jaguar
80
Future
z IBM Cyclops64 – supercomputer on a chip
z C-DAC initiative for 2010 –petaflop
machine
z NCSA, USA 2011 petaflop machine
z NASA, SGI and Intel Pleiades – 10
petaflop by 2012
z 1 Exaflop (10**18 flops) by 2019
z Human brain neural simulations – 10
exaflop by 2025
z 2-week Full Weather modeling – 1 zeta
flops (10**21 flops) by 2030
High Performance Computing and Networking

@
Advanced Computational Research Lab
(ACRL) Infrastructure
ACEnet: Atlantic Computational Excellence

Network
“People, Research, Excellence”
Hosting sites:
Member sites:
ACEnet
z Atlantic Canada is a distributed environment
z $30 million initiative
z Waterways make networking

solutions difficult (e.g. Cabot Strait)
ACEnet
z World-class HPC facilities
z Behave as a single, regionally distributed

“computational power grid”
z Create and operate sophisticated

collaboration facilities to bind together
geographically dispersed research
communities.
ACEnet at UNB
Fundy: SUN cluster, AMD Opeteron, 632 cores

ACEnet: 3324 cores
Internet connectivity > 2Gbps at UNB
Collaboration Grid
z Collaboration gear across Atlantic Canada

z Lecture rooms equipped so ACEnet sites can share
seminars and participate remotely
z ACEnet cafés at each site sharing continuous video
feeds
z Desktop level collaboration equipment for personal
communication
z Access Grid streams tens to hundreds of

Mbps across the CANARIE
network
ACEnet
Bioinformatics Research
@
The Canadian Potato Genome Project
Collaborators
Dr.Patricia Evans (UNB), Dr.Barry Flinn (BioAtlantech), Dr. David Dekoyer (Potato
Research Center), Carleton University, Nova Scotia Agricultural College
Students: Aijazuddin Syed (MCS Student), En Zhang (MCS Student),

Zheng Wang (MCS Student), Marc Cooper (MCS Student),
Rachita Sharma (PhD Student)
Potato
Integral part of diet – French fries,
mashed potatoes
Provides 12 essential vitamins
Fourth important crop worldwide
Potato has not been explored in terms

of functional and bio-chemical traits
Potato genome is much unknown

regarding the control of potato
development and processing/quality
traits (disease resistance, stress tolerance, carbohydrate
metabolism, tuber shape)
Economic Importance Of The Potato
• Integral part of the diet of a large

proportion of the world’s population
• Supplies at least 12 essential vitamins

and minerals
• Still much unknown regarding the

control of potato development and
processing/quality traits
(ie. disease resistance, stress tolerance, carbohydrate metabolism, tuber shape)
The Canadian Potato Genome Project (CPGP)
46% of national potato production $1 Billion/year

Home of McCain Foods Ltd. $5.5 billion/year
Potato Research Center (PRC) of AAFC
Solanum Genomics International Inc./BioAtlantech
Carleton University
Nova Scotia Agricultural College (NSAC)
CPGP Goals
CPGP targets genes associated with

tuber health and tuber quality:
Tuber Health – Late Blight and
Common Scab
Leaf Tuber
Tuber Quality – Stable dry matter
accumulation, cold sweetening and
after-cooking darkening
DNA
Gene 1 Gene 2 . . .
Project Description
Identification Of A Differential Gene Expression Pattern

And Genes Related To Resistance In Potato Late Blight
• One of the most devastating disease of potato worldwide

• If left unmanaged, complete destruction of crops can occur
• Attacks leaves and tubers; large necrotic lesions on leaves
and dry rot that spreads through tubers; 2o bacterial and
fungi often infect through late blight lesions
Late Blight Project
• Collaborative effort with AAFC Potato Research Centre

• Population of blight-sensitive and blight-resistant plants
of near isogenicity
• cDNA libraries made from leaves of a blight-sensitive and
a blight resistant plant
• 2500 messages were sequenced from each library
(5000 total ESTs)
• Different ESTs to be profiled for expression
• The tremendous amounts of data generated will need to be
managed efficiently
Database - Sequence Info

Late Blight Project
cDNA Microarray Using SGII Clones
• hybridized with Cy3 (resistant) + Cy5 (susceptible) probes

(reciprocal labelling experiments)
ANDLBRLF02345HTF.01 - Class II chitinase
ANDLBRLF01256HTF.01 - Pathogenesis-related protein

P23 precursor
ANDLBRLF02041HTF.01 - Unknown protein
What Use Is All Of This Information?
• Transgenics:
- Enhance tuber quality, processing traits, disease
resistance, stress tolerance more rapidly than breeding
• Expression Assisted Selection:
- Obtain expression profiles for thousands of genes
associated with specific traits or characteristics
- Use these profiles as a baseline to compare with
the expression profiles of unknown clones; crosses
• New Protein Products :
- Identify genes encoding secreted proteins/ligands
- Test these for growth-promoting/other effects
- Express genes in batch cultures and purify proteins
Example Of Gene Use
GA-20 oxidase in potato:
GFP expression in tobacco cells
• GA-20 oxidase
knockouts with
enhanced tuber
production
• GA-20 oxidase
knockouts with
reduced tuber
sprouting
Information Processing and Handling
• Assembly and annotation of genomic data
• EST analysis and databases
• Cluster analysis of microarray data
• Comparisons of various transcriptomic methods
• Integration of sequence, transcriptomic, proteomic,

metabolomic, transgenic data
The Canadian Potato Genome Project
Sequence the gene Leaf and tuber

and build cDNA libraries cDNA
[Solanum Genomics Intl. Inc
(SGII)]
EST sequence generation

Microarray profiling [National Research Council
[SGII, PRC, UNB, Ontario at Halifax and SGII]
Canter Institute, and NSAC]
Bioinformatics: base-
Calling, clustering,
BLAST, annotations,
and Gene expression FASTA formatted
[UNB and PRC] EST sequence
& trace files
Sample FASTA formatted Sequences
EST sequence
>gi|39796586|gb|CK247430.1|CK247430 EST731067 potato callus cDNA library,
mRNA sequence
ACAAGTCACTATAGGGACATGCTTCAATTTTTTCAAAACATCTTGAATAGTACAAAGTGCACAACATACT
CCAAAAAACTGAATACATTTTCTATTGTCAATATCTATAGCCATATGACTTTCAGTGCGACCTATGCATT
CATAACTCCCGCTACCAAATCCACCATGTAGTGCTTACAACAACAAGCCTAGTGAGAACGTAAGCCTGGT
CTGGAGCCAAAAGCAAATTATGTATACTAAAAAACCCCCTGGCTAAAATGCATATCATGATTAGTAGTGA
CATT
Protein Sequence
>gi|532319|pir|TVFV2E|TVFV2E envelope protein
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT
QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC
HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK
MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK
TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF
APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKNL
LAAVEAQQQMLKLTIWGVK
Standard Genetic Code
T C A G
TTT Phe (F) TCT Ser (S) TAT Tyr (Y) TGT Cys (C)
TTC " TCC " TAC TGC
T
TTA Leu (L) TCA " TAA Ter TGA Ter
TTG " TCG " TAG Ter TGG Trp (W)
CTT Leu (L) CCT Pro (P) CAT His (H) CGT Arg (R)
CTC " CCC " CAC " CGC "
C
CTA " CCA " CAA Gln (Q) CGA "
CTG " CCG " CAG " CGG "
ATT Ile (I) ACT Thr (T) AAT Asn (N) AGT Ser (S)
ATC " ACC " AAC " AGC "
A
ATA " ACA " AAA Lys (K) AGA Arg (R)
ATG Met (M) ACG " AAG " AGG "
GTT Val (V) GCT Ala (A) GAT Asp (D) GGT Gly (G)
GTC " GCC " GAC " GGC "
G
GTA " GCA " GAA Glu (E) GGA "
GTG " GCG " GAG " GGG "
Database
Contains all the EST’s sequences
Contains useful annotations

Blast Searches
Contig Assemblies
Transmembrane Spanning Regions
Gel Pictures
EST Information
Data Analysis - Bioinformatics
Tens of thousands of ESTs available for study

Most methods to study message distributions are low
throughput AND time consuming
“Genomics” necessitates the large scale study of gene
expression
Automation required for routine processes

Data acquisition for potato genome annotation
Automated protein classification with rule maintenance
Use agents to integrate the software and primary databases in
a flexible and robust way
Overview of Bioinformatics Research

at UNB
EST TraceScan
sequences
Automated Data
Acquisition Pipeline
Homologs, Motifs,
Multi-Agent
Fingerprints, Transmembrane,
System for Potato and Signal sites
Genome Annotation
Automated Protein
Classification and Rule
Maintenance
TraceScan - Keywords
Chromatogram - visual representation of the digital output produced

by an automated sequencing machine. A chromatogram is drawn as a
set of four overlapping waveforms, one for each nucleotide base
Base-calling - determining the set of nucleotide bases for a DNA

sequence strand from the analysis of the digital output produced by a
sequencing machine
Heterozygosity exists in the chromatogram where the presence of a

second strong peak appears beneath a primary peak. This may
indicate the presence of a secondary nucleotide base at the location in
the sequence
BLAST – Basic Local Alignment Search Tool
Example of a Chromatogram
The TraceScan Software System
Designed to investigate sequence quality, potential polymorphisms, and
base heterozygosity in EST sequences.
Relies on the combined analysis of a DNA sequence trace file, the trace
chromatogram, and multiple alignment of sequence homologs.
Allows base-calls to be substituted where superimposed peaks have

been detected in the trace.
Base-calls deemed in error can be corrected to improve sequence quality

and data reliability.
TraceScan
Visualizes DNA sequence chromatograms
Detects overlapping trace peaks using modifications to the PHRED

base-caller
Paks are highlighted on the user interface.
Modifications to PHRED enable base-calls with overlapping peaks to be

substituted.
Base substitutions produce a new set of base quality scores for the
sequence.
TraceScan
An interface to NCBI BLAST provides sequence comparison

capabilities.
Sequences are compared using BLASTN and BLASTX.
BLASTN alignments are analyzed in search of discrepancies that may

identify base-calling errors or putative polymorphisms in the trace
sequence.
Reading Frames from BLASTX results are analyzed to examine if

substituted base-calls result in synonymous or non-synonymous codon
substitutions.
TraceScan System Architecture

Overview of Bioinformatics
Research at UNB
TraceScan
EST
sequences
Automated Data
Homologs, Motifs,
Multi-Agent
Genome Annotation
Automated Protein
Maintenance
The Automated Data Acquisition Pipeline

(ADAP) - Keywords
Hypothetical Protein: The protein sequence that is obtained from
transcription and translation of the DNA sequence. It is hypothetical
because we do not know if it is the real protein which DNA codes to.
Homologs: Evolutionarily related protein sequences
Comparative genomics: A technique where the functional traits of a

protein sequence are learnt from its homologs
Motifs: Highly conserved regions of protein sequences
Fingerprints: Collection of motifs
BLASTP: Basic Local Alignment Search Tool for Protein to Protein

searches
Automated Data Acquisition
Pipeline (ADAP)
Gathers data for genome annotation
ADAP features:
Uses comparative genomics to learn from the Homologs
New variant of BLAST, Parameter Regulated Iterative BLAST
(PRI-BLAST)
Uses 7 various analysis/search tools
A few software design patterns are used
Perl, MySQL, Perl-DBI, BioPerl, EMBOSS, BLASTP, SGE 5.3,
and Perl-Gtk on Linux
ADAP Overview
Legend
Input: FASATA
formatted EST Data Flow
Sequences
Database
Interactions
Phase 1: Hypothetical
Perl-MySQL
protein extraction and
Database
homolog generation
Interface
Homologs and
HPs
Potato ADAP
Phase 2: Sequence based database
protein structure
prediction
Phase 3: Database search

based protein family
prediction
Parameter Regulated Iterative BLAST
(PRI-BLAST)
Static set of BLASTP parameters (neighborhood score, E-value, fraction
identical, BLOSUM matrix etc) – not good since protein evolves at different
rates
PRI-BLAST iteratively performs the BLASTP over query sequence and

categorizes the query as
a Celebrity query (C) – many homologs
an Average query (A) – a few or no homologs
an Obscured query (O) – some homologs
PRI-BLAST
Rule module
Decides which set of BLASTP parameters to use
Halts the PRI-BLAST
Statistical module
Density of homologs is computed through SQL statements
Example BLASTP report

BLASTP 2.2.8 [Jan-05-2004]
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer
. . . . . Nucleic Acids Res. 25:3389-3402.
Query= CK00043.5prime
(182 letters)
Database: All non-redundant GenBank CDS
translations+PDB+SwissProt+PIR+PRF excluding environmental samples
1,795,144 sequences; 592,604,613 total letters
Searching..................................................done
Score E
Sequences producing significant alignments: (bits) Value
gb|AAD46849.2| LD03471p [Drosophila melanogaster] 329 5e-90
ref|NP_651977.1| CG6773-PA [Drosophila melanogaster] >gi|7300991... 285 1e-76
ref|XP_312881.1| ENSANGP00000014751 [Anopheles gambiae] >gi|2129... 209 7e-54
gb|AAH54585.1| Unknown (protein for MGC:63980) [Danio rerio] 184 4e-46
.
.
.
>gb|AAD46849.2| LD03471p [Drosophila melanogaster] Length = 386
Score = 329 bits (1155), Expect = 5e-90
Identities = 181/182 (99%), Positives = 181/182 (99%)
Query: 1 VKRRKKTRLAFNQFIWRPDERISSKMVSLLQEIDTEHEDMVHHAALDFYGLLLATCSSDG 60
VKRRKKTRLAFNQFIWRPDERISSKMVSLLQEIDTEHEDMVHHAALDFYGLLLATCSSDG
Sbjct: 6 VKRRKKTRLAFNQFIWRPDERISSKMVSLLQEIDTEHEDMVHHAALDFYGLLLATCSSDG 65
Query: 61 SVRIFHSRKNNKALAELKGHQGPVWQVAWAHPKFGNILASCSYDRKVIVWKSTTPRDWTK 120
SVRIFHSRKNNKALAELKGHQGPVWQVAWAHPKFGNILASCSYDRKVIVWKSTTPRDWTK
Sbjct: 66 SVRIFHSRKNNKALAELKGHQGPVWQVAWAHPKFGNILASCSYDRKVIVWKSTTPRDWTK 125
Query: 121 LYEYSNHDSSVNSVDFAPSEYGLVLACASSDGSVSVLTCNTEYGVWDAKKIPNXHTIGVN 180
LYEYSNHDSSVNSVDFAPSEYGLVLACASSDGSVSVLTCNTEYGVWDAKKIPN HTIGVN
Sbjct: 126 LYEYSNHDSSVNSVDFAPSEYGLVLACASSDGSVSVLTCNTEYGVWDAKKIPNAHTIGVN 185
Query: 181 AI 182
AI
Sbjct: 186 AI 187
motif search based Protein Sequence Analysis
(mPSA)
Motifs are conserved regions of protein sequences, and fingerprint is

a collection of motifs in some order
mPSA (Phases 2 & 3) for the ADAP contains 6 mPSA tools from
EMBOSS
Phase 2: sequence based mPSA

secondary structure: transmembranes(Tmap), signal sites
(Sigcleave), and general secondary structure (Garnier)
super secondary structure: DNA binding sites (Helixturnhelix)
Phase 3: database search based mPSA

protein motifs from PROSITE (Patmatmotifs) and protein
fingerprints from (Pscan)
Homologues for Various Ranges of Lengths of Hyp. Proteins
10000
9000
8768
8000
7000
Number of Homologues
6000
5235
5000 Homologues (Total)
4000
3000 2882
2020
2000
1633
979
1000 873
550 592 516 380
434124 495
288 221 6 22 53 279
0 1
10 - 15
15 - 20
20 - 25
25 - 30
30 - 35
35 - 40
40 - 45
45 - 50
50 - 55
55 - 60
60 - 65
65 - 70
70 - 75
75 - 80
80 - 85
85 - 90
90 - 95
95 - 100
100 - 105
105 - 110
110 - 115
Length of Hyp. Protein
Shorter protein sequences have more homologs – they can be false positives
Homologues with E<1 and E<100 for Various Ranges of Lengths of Hyp.Proteins
100.0% 100.0%
100.0%
100.0% 100.0%
90.0%
Percentage of Homologues w.r.t. Total No. of Homologues
85.3%
84.3%
80.0%
72.9%
70.0%
65.8%
60.0%
51.9%
50.0% 50.3% Percentage Homologues
48.6%
44.8% 45.8% E<1
40.0% 41.5%
30.8% Percentage Homologues

30.0% 27.7% E<100
28.6%
24.9%
25.7%
20.0% 19.4%
18.0%
15.3%
8.8%
10.0% 4.9% 8.6%
5.1% 6.3% 3.9%
0.3% 4.8% 0.5% 1.3% 0.0%
0.0% 0.1%0.0% 0.4% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0%
10 - 15
15 - 20
20 - 25
25 - 30
30 - 35
35 - 40
40 - 45
45 - 50
50 - 55
55 - 60
60 - 65
65 - 70
70 - 75
75 - 80
80 - 85
85 - 90
90 - 95
95 - 100
100 - 105
105 - 110
110 - 115
Length of Hyp. Protein
Shorter sequences have a large E-value, hence we cannot use them in Comparative genomics
Fraction of Query Protein Conserved: Homologs (E<1 & Length>35) Vs Homologs (E<1,
Length>35, and have a Fingerprint)
6500
6000
5500
5000
4500
4000
Number of Hsps
3500 Hsps (E<1 &

Length>=35
3000
2500
2000
1500
Hsps w ith E<1, FP &
Length>=35
1000
500
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
Fraction Conserved
Generally, the structure of the protein sequences is conserved if they have a sequence
similarity of 35% or more – selected region in the graph shows the useful homologs
Bioinformatics Research at UNB
TraceScan
EST
sequences
Automated Data
Homologs, Motifs,
Multi-Agent
Genome Annotation
Automated Protein
Maintenance
Automated Protein Classification and

Rule Maintenance
Use machine-learning techniques to find some rules
Apply the rules to classify uncharacterized sequences
Categorized A decision tree

sequences and Rule consisting of
their related data Construction rules
Process
Newly
Uncharacterized Rule application characterized
sequences process sequences
Automated Protein Classification and
Rule Maintenance
Source data collection
Automated rule generation
Machine-learning algorithms and their comparison
Automated rule maintenance
Automated Rule Generation

C4.5 and CITree algorithms produce decision trees
WEKA (Waikato Environment for Knowledge Analysis ) will be used for
analyzing the dataset. (http://www.cs.waikato.ac.nz/~ml/index.html)
Start
Sequences and their

related data
Yes Rule Construction & Decision
Tree Creation
End of Rules?
No
Rule Sieving
No
Is the rule
qualified?
Yes
Update Rule Database Rule Database
Apply rules to annotate target

sequences
Target Sequence
Update Target Sequence Database Database
End
Rule Generation process

Comparison of Algorithms
The evaluation of criteria for machine learning algorithms: accuracy
and AUC (Area Under the ROC (Receiver Operating Characteristics)
Curve)
Performance analysis
Tree Generated using Weka
Bioinformatics Research at UNB
TraceScan
EST
sequences
Automated Data
Homologs, Motifs,
Multi-Agent
Genome Annotation
Automated Protein
Maintenance
Multi-agent Systems
A multiagent system is one that consists of a number of agents, which
interact with one-another
In the most general case, agents will be acting on behalf of users with
different goals and motivations
To successfully interact, they will require the ability to cooperate,

coordinate, and negotiate with each other, much as people do
Multi-Agent System for Potato

Genome Annotation
Local Database
NRDB MONTH PRINTS PROSITE
AUTOMATED DATA
ACQUISITION
PIPELINE
INFORMATION PIPELINE
AGENT AGENT
CLASSIFICATION
MODULE
WEB RULE
Rule Database
CONSTRUCTION
AGENT
TargetSequence
Target Sequence
DATABASE Database
Database
UPDATE AGENT
Mapping Transcription factors from a
Model to a non-Model Organism
Transcription Factor
z Group of proteins that initiate transcription

z transcriptional activators
z transcriptional repressors
z Consists of DNA binding domains
z Binds to the binding site regions (specific DNA
sequences)
z Controls the expression of the genes
z Human genome: 2600 proteins contain DNA-
binding domains
136
Transcription Factor Mapping
A A1
B B1
C C1
Source Genome Target Genome

Model Organism Non-Model Organism
• Investigated thoroughly by biologists • Not much data available
• Nodes: Transcription factors • Nodes: Predicted transcription factors
137
Transcription Factor Mapping
138
Methodology
z BLASTP is used to map transcription factors from E

coli and Bacillus subtillis to E.coli group and Bacillus
group
z Parameter E-value threshold: 1e-5 to 10
z All transcription factors from one genome cannot be

mapped to another genome
z The number of confirmed mappings between any two

genomes is dependent on the definition of confirmed
mapping used
z Compare the available transcription factors of the target genome to
the predicted set of transcription factors
139
Summary of Mapping Results
z Transcription factor mapping in bacterial

genomes
z Proposed method is able to map most of the
transcription factors
z Transcription factor sequence motifs are
preserved well
z 0.1 and 0.01: best e-value thresholds
z Correct choice of e-value threshold can be more
important than selection of evolutionarily closer
model organism
140
Bioinformatics @ C-DAC
Dr. Rajendra Joshi

Group Coordinator: Bioinformatics
Scientific and Engineering Computing Group
Centre for Development of Advanced Computing
Pune - 411007
rajendra@cdac.in
http://bioinfo-portal.cdac.in
Bioinformatics Resources &

Applications Facility (BRAF)
• Funded by the Department of Information
Technology (DIT), Ministry of Communications and
Information Technology
• Grid-enabling of numerous bioinformatics codes

like SW, BLAST, ClustalW, AMBER, CHARMM etc
• As part of BRAF, the team interacted with

scientists from various CSIR labs, IITs and
industries
BIOGENE: 1TF machine
z AMD processor 2.6Ghz (Total: 204
cores, 1060.8 GF)
z 4 nos. of SunX4600 (8 socket dual

core each) giving 64 cores.
z 32 nos. of SunX2200 (dual socket

dual core each) giving 128 cores.
z Backup server: SunX2200 (4 cores)
z Storage server: two Sun X2200 (8

cores)
z Infiniband switch (Mellanox DDR2,

48 port)
z Storage: 20 Terabytes, RAID5
z Tape library with autoloader
z Benchmarking completed for

AMBER, CHARMM, MEME, SW,
Fasta, ClustalW, BLAST
Using BRAF Facility
• Gipsy portal: Use browser and

open the url
http://gipsy.bioinfo-
portal.cdac.in
• Command line login

ssh -p 30005 gateway.cdac.in
• Help on command line usage is
available in the README file in
the users home directory.
• Helpline: braf-help@cdac.in
Bioinformatics Application Software
for High-End Clusters and Grid
Anvaya : A Workflow Environment for High Throughput Comparative Genomics
Taxo Grid : Phylogeny on Grid
iMolDock : An interface for Molecular Docking on HPC
GENOPIPE : Automated Genome Annotation Pipeline on HPC
GenomeGrid : Bioinformatics Problem Solving Environment on Grid
GIPSY : Bioinformatics Problem Solving Environment on HPC
High-throughput Workflows for

Genome Analysis
Collaboration: Biotechnology and
Biological Sciences Research Council (UK)
z A Systems Biology based
approach for annotation of
Salmonella and
Mycobacterium genomes
z Establishment of a common
Bioinformatics pipeline for
analyses of bacterial genomes
with emphasis on identification
of virulence and pathogenic
factors
Collaboration: Institute of Animal

Health (UK)
• Genome Annotation: Salmonella
z Causative agent of Typhoid
z Transmitted via food contamination
z Economic losses as it affects
livestock
• Annotation of 5 Salmonella Food-borne disease cycle: Salmonella
genomes with a wide host-range
Genome Annotation via GENOPIPE

Single nucleotide polymorphism
Collaboration: University of Surrey (UK)
z Expert curation of Mycobacterium leprae

genome: causative agent of Leprosy
z Development of a tool to calculate molecular
weight of metabolites
Collaboration:
Oregon Health & Science University (USA)
¾ Collaborative project initiated with OHSU in December 2009
¾ Provide computational support to the experimental studies at OHSU,
through MD simulations on BIOGENE cluster
¾ Propeptide domain of serine protease Furin acts as a pH sensor
¾ Phenomenon has been elucidated in-silico through MD simulations
¾ Ten sets of simulations performed using NAMD
Furin Complex
Collaborations: caBIG (NIH)
z The National Cancer Institute (NCI) is

involved in deployment of an integrated
biomedical informatics infrastructure,
the cancer Biomedical Informatics Grid
(caBIG™)
z network that will freely connect the
entire cancer community
z caBIG would setup node at CDAC
z GARUDA GRID and BRAF resources
may be used
Collborations: IIT Madras

CGMD studies on GPCR
OA1 (GPR143) – a
GPCR
• Belongs to Class I GPCR,
Rhodopsin family
• 7TM receptors or heptahelical
receptors
• An integral membrane
glycoprotein of 404 aa
• Protein product of ocular albinism
type 1 gene
• Ocular albenism, a X-linked
inherited disorder in which the
eye lacks melanin pigment
• Homology based approach along
with CGMD simulation has been
planned for this work
Collaboration: Jubilant Biosys
z Simulate fragment binding
sites by Molecular Dynamics
simulation methods
z To identify most probable
site of interaction of
chemical fragments in the
protein.
z 8 large simulations of 10ns
each was carried out
z Results handed over to
Jubilant
Collaboration: Nicholas Piramal

z Contract Research project
z To understand protein ligand
interactions using Molecular
Dynamics simulations
z Involves carrying out
molecular dynamics
simulations on very large
biomolecular systems
z Benefits in designing better
molecules for known drug
targets.
z Four 20ns molecular
dynamics simulations have
been carried out
Conclusion
z Biology – transforming from observational and physical

experiments Æ computational science
z Bioinformatics - Exciting research area
z Challenges – Biology and Computer Science – different ways

of working and need for close collaboration
z Opportunities – new crops, personalized medicine, early

diagnosis, …
155
Research Problems in Bioinformatics

Find genomes of all organisms
Identify and annotate all genes
Compute sequence <-> 3D structure for all proteins
Compare DNA / protein sequences for similarity
Compare families of DNA / protein sequences
Reason to be optimistic: Biology is finite…

~30,000 human genes; ~1000 protein superfamilies
…but computers speeds keep increasing
156
Business Opportunities
• Clinical research
• Gene therapy
• Molecular science
• Pharmaceutical companies - automated technologies to
manufacture effective therapies and drugs due to increasing
concerns about drug safety and the stringent regulations that
govern clinical trials for drug discovery.
• Bioinformatics platform market – growing very fast rate
• Global bioinformatics market: ~ $8.3 billion by 2014
• Knowledge management - 2009 -$1.3 billion
• Bioinformatics platforms market - 2014 - ~ $3.9 billion
157
Business Opportunities
Global bioinformatics market segments
- Bioinformatics platforms
- Sequence alignment platforms
- Sequence manipulation platforms,
- Sequence analysis platforms
- Structural analysis platforms
- Content/Knowledge management tools
- Specialized knowledge management tools
- Generalized knowledge management tools
- Services
- Data Analysis
- Sequencing Services
- Database & Management services
- Applications
158
Thank You!

Hop Indonesia

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hop Indonesia

Uploaded by

Copyright:

Available Formats

Bioinformatics – An Overview

Dr. Virendrakumar (Virendra) C. Bhavsar

• Databases and Information Retrieval

• Sequence Alignment and Phylogenetic trees

• Protein Structure and Drug Discovery

• Proteomics and Systems Biology

• Infrastructure: UNB and C-DAC

• Research Work at the University of New Brunswick and C-DAC

Faculty of Computer Science

University of New Brunswick

Fredericton and UNB

Center for Development of Advanced

India requires Supercomputer for

Government of USA refuses sale of

The Government of India decides to

C-DAC: HPC : Evolution and

Total Manpower is 2100 across all the centres of C-DAC

C-DAC’s Thrust Areas

• High Performance Computing & Grid Computing

• Cyber Security & Cyber Forensics

• Education & Training

z An outfit initiated by C-DAC

Post Graduate Courses

Computer Science & Engineering

Molecular Biology - DNA

- Chromosomes are duplicated before cells divide, in a

- Eukaryotic organisms (animals, plants, fungi, and

- Prokaryotes (bacteria and archaea) store their DNA only

Protein: RNA → Protein (Translation)

DNA, RNA, Proteins

Raw Biological data

Triplets of DNA called ‘Codons’ code into a amino acid

The structure of the protein sequence determines the

FASTA: "FAST-All“ alignment; it works with any alphabet

Sample FASTA formatted Sequences

EST sequence (A, C, G, T)

Protein Sequence (20 different amino acids)

 Genome databases – flat files or relational database

 GenBank, EMBL, DDBJ, PDB, SWISSPROT, PIR

 Classification of Biological databases:

z Like any other database

z Data is of different types

Biological databases -Examples

z 3D Macromolecular structural data

z Data originates from NMR or X-ray

z If the 3D structure of a protein is solved ...

What to take home

z Databases are a collection of data

Similar composition → similar function?

What Does Genomics Mean?

 Proteins are derived from DNA

Gene 1 Gene 2 Etc.

Promoter Coding ORF

cDNA Collections (Libraries)

• Various tissues are collected from the plant,

• The messages are “copied” to form double-

Leaf cDNA Tuber cDNA

• Each copy is “glued” into a piece of bacterial DNA

cDNA Collections (Libraries)

• The cDNAs are then read or “sequenced”, to give the

• Messages from various genes are important,

• Promoters are also important, as they dictate

Genome databases – flat files or relational database

GenBank, EMBL, DDBJ, PDB, SWISSPROT, PIR

Classification of Biological databases:

Proteins are derived from DNA

Provides 12 essential vitamins

Fourth important crop worldwide

Potato has not been explored in terms

Potato genome is much unknown

46% of national potato production $1 Billion/year

CPGP targets genes associated with