Professional Documents
Culture Documents
Visiting Professor
Center for Development of Advanced Computing (C-DAC)
Pune, India
Outline
• Introduction – UNB, C-DAC, Bioinformatics
• Genome – Genes, Proteomes, Evolution
• Future
2
University of New Brunswick (UNB)
4
5
8
History
1987
PARAM 8000
Technology Denial
C-DAC Centres
• Headquarter
– Pune
• Centres
– Pune
– Knowledge Park, Bangalore
– Electronics City, Bangalore
– Chennai
– Delhi
– Hyderabad
– Kolkata
– Mohali
– Mumbai
C-DAC HQ
– Noida
– Thiruvananthapuram Centres
• Multilingual Computing
• Tools, Fonts, Products, Solutions, Research, Technology Development
• Software Technologies
• OSS, Multimedia, ICT for masses, E-Governance, Geomatics
• Professional Electronics
• Digital Broadband, Wireless Systems, Network Technologies, Power Electronics, Real-Time
Systems, Embedded Systems, VLSI/ASIC Design, Agri Electronics
• Health Informatics
• Hospital Information System, Telemedicine, Decision Support System
• Ubiquitous Computing
• RFID, Design, Development and Integration of Ubicomp System Components
File Servers
No. of Processors : 24 (UltraSparc-III@900MHz)
Aggregate Memory : 96 GigaBytes
Internal Storage : 0.4 TeraBytes
File System : QFS
Operating System : Solaris 8
Networks
Primary : PARAMNet-II @ 2.5 Gbps Full Duplex
Backup : Gigabit Ethernet @ 1 Gbps Full Duplex
Management : 10/100 MBPs Fast Ethernet
External Storage
Storage Array : 5 TeraBytes with 16 T3 disk arrays
Tape Library : 12 TeraBytes - L700 (5 LTO drives
Software
HPCC - C-DAC’s High performance computing and communication software suite
Compilers, Parallel Libraries and Tools
Ranked 171 in 2nd quarter end and 258 as per the latest ranking
C-DAC
Advanced Computing Training School (ACTS)
ACTS @ a glance
International Presence
Azerbaijan Saudi Arabia Belarus
Russia
Tajikistan
Uzbekistan
Turkmenistan
Mauritius
Ghana
Armenia
Myanmar
Tanzania
Seychelles
Lesotho
Post Graduate Diploma Programs
M.Tech. Programs
Bioinformatics
20
Definitions
Bioinformatics
The creation and development of advanced
information and computational techniques for solving
problems in biology
and development of advanced information and
High Performance Computing (HPC)
Hardware and software for high speed computations
and large storageor solving problems in biology
21
“Bio” Introduction
22
Molecular Biology
inLiving
biology
organisms (on Earth)
Lipids - Separate inside from outside
Proteins – Build 3D machinery to perform biological
functions
DNA: Store information on how to build machinery (DNA)
Diagram of a cell
Lipid membranes - provide barrier
Protein structures - do work
DNA nucleus - store info
23
Molecular Biology
inDeoxyribonucleic
biology Acid (DNA)
Composition
- Sequence of nucleotides
0Nucleotide = deoxyribose sugar + phosphate group +
base
24
Molecular Biology - DNA
DNA: contains genetic instructions used in the
indevelopment
biology and functioning of all known living
organisms with the exception of some viruses.
DNA molecules: long-term storage of information.
DNA: a set of blueprints, like a recipe or a code, since it
contains the instructions needed to construct other
components of cells, such as proteins and RNA
molecules.
Genes: The DNA segments that contain instructions to
construct the above components of cells
Other DNA sequences: structural purposes, or are
involved in regulating the use of this genetic information.
Chemically, DNA consists of two long polymers of simple
units called nucleotides, with backbones made of sugars
and phosphate groups joined by ester bonds. These two
strands run in opposite directions to each other and are
therefore anti-parallel. Attached to each sugar is one of
four types of molecules called bases. It is the sequence
25
of these four bases along the backbone that encodes
i f ti Thi i f ti i d i th ti
27
Molecular Biology
RNA: Ribonucleic acid (RNA)
in- biology
a long chain of nucleotide units
- Each nucleotide consists of a nitrogenous base, a
ribose sugar, and a phosphate
RNA is very similar to DNA
RNA is usually single-stranded
DNA is usually double-stranded
RNA nucleotides contain ribose while DNA contains
deoxyribose (a type of ribose that lacks one oxygen
atom)
RNA has the base uracil rather than thymine that is
present in DNA
28
Molecular Biology
DNA: DNA → DNA (Replication)
in biology
RNA: DNA → RNA (Transcription / Gene
Expression)
29
Chromosome
DNA
DNA
Gene 1 Gene 2 . . . .
Raw Biological data Nucleic Acids (DNA)
TTT Phe (F) TCT Ser (S) TAT Tyr (Y) TGT Cys (C)
TTC " TCC " TAC TGC
T
TTA Leu (L) TCA " TAA Ter TGA Ter
TTG " TCG " TAG Ter TGG Trp (W)
CTT Leu (L) CCT Pro (P) CAT His (H) CGT Arg (R)
CTC " CCC " CAC " CGC "
C
CTA " CCA " CAA Gln (Q) CGA "
CTG " CCG " CAG " CGG "
ATT Ile (I) ACT Thr (T) AAT Asn (N) AGT Ser (S)
ATC " ACC " AAC " AGC "
A
ATA " ACA " AAA Lys (K) AGA Arg (R)
ATG Met (M) ACG " AAG " AGG "
GTT Val (V) GCT Ala (A) GAT Asp (D) GGT Gly (G)
GTC " GCC " GAC " GGC "
G
GTA " GCA " GAA Glu (E) GGA "
GTG " GCG " GAG " GGG "
AAProtein
ProteinStructure
Structure
Protein 3D structure
http://anatomy.med.unsw.edu.au/cbl/research/cytoskeleton/swissprotactin.htm
“Informatics”
36
FASTA formatted Sequences
Biological databases
41
z Nucleotide Databases
Alternative Splicing, EMBL-Bank, Ensembl, Genomes Server, Genome,
MOT, EMBL-Align, Simple Queries, dbSTS Queries, Parasites, Mutations,
IMGT
z Genome Databases
Human, Mouse, Yeast, C.elegans, FLYBASE, Parasites
z Protein Databases
Swiss-Prot, TrEMBL, InterPro, CluSTr, IPI, GOA, GO, Proteome Analysis,
HPI, IntEnz, TrEMBLnew, SP_ML, NEWT, PANDIT
z Structure Databases
PDB, MSD, FSSP, DALI
z Microarray Database
ArrayExpress
z Literature Databases
MEDLINE, Software Biocatalog, Flybase Archives
z Alignment Databases
BAliBASE, Homstrad, FSSP
PDB –Protein Data Bank
45
Premise of Bioinformatics
Gene sequences determine biological function
Genomic DNA → Amino acids → Proteins → Function
46
Bioinformatics
Determining protein function
Hard way
-Biological / chemical analyses
- Determine 3D structure w/ x-ray crystallography, NMR
Easy way?
- Sequence protein / DNA → find close match in database
- Guess function based on match
- Validate guess in lab
Bioinformatics is imprecise
- Similar to data-mining
- Only suggests possible relationships
- Must validate correlation → causation
47
Growth of Bioinformatics
1970’s
- DNA sequencing
- Alignment w/ Smith-Waterman (dynamic programming)
1980’s
- Sequence databases (EMBL, GenBank)
- Alignment w/ FASTA (linked lists, hashing)
1990’s
- Automatic DNA sequencing
- Alignment w/ BLAST (neighborhood words, probabilities)
- Internet & WWW
Now
- Genomics, Proteomics
48
Bioinformatics Topics
Sequence alignments
- Find similarity between DNA / protein (amino acid) sequences
Genome assembly
- Combining genomic fragments to form whole genome
Gene identification & annotation
- Identify and classify genes on the genome
Microarrays & gene expression analysis
- Use DNA microarray (gene chip) to measure mRNA
Protein folding
- Compute 3-D protein structure ↔ protein sequence
Phylogenetic analysis
- Find genetic relationships between sequences and speciesbetween
between sequences / species
49
How do we do this?
Genome Organization
Chromosome
DNA
Leaf Tuber
Genome Organization
Proteins are building blocks for living organisms
DNA Gene can also be known by finding complimentary (cDNA), the active
or expressed gene is termed as Expressed Sequence Tags (ESTs)
Chromosome
DNA
DNA
Gene 1 Gene 2 . . . .
Genome Organization
DNA
....TATACAGCAAAATAGAAAGATCTAGTGTCCCATGGCGATGAGTCGTGTAGCTTCT….
Leaf
Messages
Tuber
Messages
cDNA Collections (Libraries)
Database
Microarray Analysis
Microarray Analysis
Microarray Analysis
Microarray Analysis
Microarray Analysis - Processing
Image Processing
12
R2 = 0.6185
10
4 Slide3
Data Normalization
Log(R/G)
Slide70
0
0 2 4 6 8 10 12 14 16 18
-2
-4
-6
0.5*(Log(G) + Log(R))
Analysis
Differential
Cluster Pathway
Gene
Analysis Analysis
Expression
Signal
Background
Project
Database
Engine
73
Fighting
FightingBird
BirdFlu
Flu
Virus
Virusin
in3-D
3-D
76
Advances in Microprocessor Technology
1974 - 1 MHz clock
1988 – 40 MHz
2002 – 2 GHz
2009 – P4 3.0 GHz, Quadcore 2.66 MHz
Intel Montecito chip
1.72 Billion transistors
NVidia 280 series GPU 1.4 Billion transistors
78
Current Supercomputer – Nov 2009
Jaguar
79
80
Future
z IBM Cyclops64 – supercomputer on a chip
z C-DAC initiative for 2010 –petaflop
machine
z NCSA, USA 2011 petaflop machine
z NASA, SGI and Intel Pleiades – 10
petaflop by 2012
z 1 Exaflop (10**18 flops) by 2019
z Human brain neural simulations – 10
exaflop by 2025
z 2-week Full Weather modeling – 1 zeta
flops (10**21 flops) by 2030
Hosting sites:
Member sites:
ACEnet
ACEnet
Bioinformatics Research
@
University of New Brunswick
The Canadian Potato Genome Project
Collaborators
Dr.Patricia Evans (UNB), Dr.Barry Flinn (BioAtlantech), Dr. David Dekoyer (Potato
Research Center), Carleton University, Nova Scotia Agricultural College
Potato
Integral part of diet – French fries,
mashed potatoes
DNA
Gene 1 Gene 2 . . .
Project Description
• Transgenics:
- Enhance tuber quality, processing traits, disease
resistance, stress tolerance more rapidly than breeding
• Expression Assisted Selection:
- Obtain expression profiles for thousands of genes
associated with specific traits or characteristics
- Use these profiles as a baseline to compare with
the expression profiles of unknown clones; crosses
• New Protein Products :
- Identify genes encoding secreted proteins/ligands
- Test these for growth-promoting/other effects
- Express genes in batch cultures and purify proteins
Example Of Gene Use
GA-20 oxidase in potato:
• GA-20 oxidase
knockouts with
enhanced tuber
production
• GA-20 oxidase
knockouts with
reduced tuber
sprouting
Bioinformatics: base-
Calling, clustering,
BLAST, annotations,
and Gene expression FASTA formatted
[UNB and PRC] EST sequence
& trace files
EST sequence
>gi|39796586|gb|CK247430.1|CK247430 EST731067 potato callus cDNA library,
mRNA sequence
ACAAGTCACTATAGGGACATGCTTCAATTTTTTCAAAACATCTTGAATAGTACAAAGTGCACAACATACT
CCAAAAAACTGAATACATTTTCTATTGTCAATATCTATAGCCATATGACTTTCAGTGCGACCTATGCATT
CATAACTCCCGCTACCAAATCCACCATGTAGTGCTTACAACAACAAGCCTAGTGAGAACGTAAGCCTGGT
CTGGAGCCAAAAGCAAATTATGTATACTAAAAAACCCCCTGGCTAAAATGCATATCATGATTAGTAGTGA
CATT
Protein Sequence
>gi|532319|pir|TVFV2E|TVFV2E envelope protein
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT
QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC
HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK
MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK
TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF
APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKNL
LAAVEAQQQMLKLTIWGVK
Standard Genetic Code
T C A G
TTT Phe (F) TCT Ser (S) TAT Tyr (Y) TGT Cys (C)
TTC " TCC " TAC TGC
T
TTA Leu (L) TCA " TAA Ter TGA Ter
TTG " TCG " TAG Ter TGG Trp (W)
CTT Leu (L) CCT Pro (P) CAT His (H) CGT Arg (R)
CTC " CCC " CAC " CGC "
C
CTA " CCA " CAA Gln (Q) CGA "
CTG " CCG " CAG " CGG "
ATT Ile (I) ACT Thr (T) AAT Asn (N) AGT Ser (S)
ATC " ACC " AAC " AGC "
A
ATA " ACA " AAA Lys (K) AGA Arg (R)
ATG Met (M) ACG " AAG " AGG "
GTT Val (V) GCT Ala (A) GAT Asp (D) GGT Gly (G)
GTC " GCC " GAC " GGC "
G
GTA " GCA " GAA Glu (E) GGA "
GTG " GCG " GAG " GGG "
Database
EST TraceScan
sequences
Automated Data
Acquisition Pipeline
Homologs, Motifs,
Multi-Agent
Fingerprints, Transmembrane,
System for Potato and Signal sites
Genome Annotation
Automated Protein
Classification and Rule
Maintenance
TraceScan - Keywords
Example of a Chromatogram
The TraceScan Software System
Designed to investigate sequence quality, potential polymorphisms, and
base heterozygosity in EST sequences.
Relies on the combined analysis of a DNA sequence trace file, the trace
chromatogram, and multiple alignment of sequence homologs.
TraceScan
Base substitutions produce a new set of base quality scores for the
sequence.
TraceScan
Automated Data
Acquisition Pipeline
Homologs, Motifs,
Multi-Agent
Fingerprints, Transmembrane,
System for Potato and Signal sites
Genome Annotation
Automated Protein
Classification and Rule
Maintenance
ADAP features:
Uses comparative genomics to learn from the Homologs
New variant of BLAST, Parameter Regulated Iterative BLAST
(PRI-BLAST)
Uses 7 various analysis/search tools
A few software design patterns are used
Perl, MySQL, Perl-DBI, BioPerl, EMBOSS, BLASTP, SGE 5.3,
and Perl-Gtk on Linux
ADAP Overview
Legend
Input: FASATA
formatted EST Data Flow
Sequences
Database
Interactions
Phase 1: Hypothetical
Perl-MySQL
protein extraction and
Database
homolog generation
Interface
Homologs and
HPs
Potato ADAP
Phase 2: Sequence based database
protein structure
prediction
PRI-BLAST
Rule module
Decides which set of BLASTP parameters to use
Halts the PRI-BLAST
Statistical module
Density of homologs is computed through SQL statements
mPSA (Phases 2 & 3) for the ADAP contains 6 mPSA tools from
EMBOSS
10000
9000
8768
8000
7000
Number of Homologues
6000
5235
5000 Homologues (Total)
4000
3000 2882
2020
2000
1633
979
1000 873
550 592 516 380
434124 495
288 221 6 22 53 279
0 1
10 - 15
15 - 20
20 - 25
25 - 30
30 - 35
35 - 40
40 - 45
45 - 50
50 - 55
55 - 60
60 - 65
65 - 70
70 - 75
75 - 80
80 - 85
85 - 90
90 - 95
95 - 100
100 - 105
105 - 110
110 - 115
Shorter protein sequences have more homologs – they can be false positives
Homologues with E<1 and E<100 for Various Ranges of Lengths of Hyp.Proteins
100.0% 100.0%
100.0%
100.0% 100.0%
90.0%
Percentage of Homologues w.r.t. Total No. of Homologues
85.3%
84.3%
80.0%
72.9%
70.0%
65.8%
60.0%
51.9%
50.0% 50.3% Percentage Homologues
48.6%
44.8% 45.8% E<1
40.0% 41.5%
15 - 20
20 - 25
25 - 30
30 - 35
35 - 40
40 - 45
45 - 50
50 - 55
55 - 60
60 - 65
65 - 70
70 - 75
75 - 80
80 - 85
85 - 90
90 - 95
95 - 100
100 - 105
105 - 110
110 - 115
Length of Hyp. Protein
Shorter sequences have a large E-value, hence we cannot use them in Comparative genomics
Fraction of Query Protein Conserved: Homologs (E<1 & Length>35) Vs Homologs (E<1,
Length>35, and have a Fingerprint)
6500
6000
5500
5000
4500
4000
Number of Hsps
2500
2000
1500
Hsps w ith E<1, FP &
Length>=35
1000
500
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
Fraction Conserved
Generally, the structure of the protein sequences is conserved if they have a sequence
similarity of 35% or more – selected region in the graph shows the useful homologs
Bioinformatics Research at UNB
TraceScan
EST
sequences
Automated Data
Acquisition Pipeline
Homologs, Motifs,
Multi-Agent
Fingerprints, Transmembrane,
System for Potato and Signal sites
Genome Annotation
Automated Protein
Classification and Rule
Maintenance
Newly
Uncharacterized Rule application characterized
sequences process sequences
Automated Protein Classification and
Rule Maintenance
Source data collection
Start
End of Rules?
No
Rule Sieving
No
Is the rule
qualified?
Yes
Target Sequence
Update Target Sequence Database Database
End
Performance analysis
Tree Generated using Weka
TraceScan
EST
sequences
Automated Data
Acquisition Pipeline
Homologs, Motifs,
Multi-Agent
Fingerprints, Transmembrane,
System for Potato and Signal sites
Genome Annotation
Automated Protein
Classification and Rule
Maintenance
Multi-agent Systems
A multiagent system is one that consists of a number of agents, which
interact with one-another
In the most general case, agents will be acting on behalf of users with
different goals and motivations
AUTOMATED DATA
ACQUISITION
PIPELINE
INFORMATION PIPELINE
AGENT AGENT
CLASSIFICATION
MODULE
WEB RULE
Rule Database
CONSTRUCTION
AGENT
TargetSequence
Target Sequence
DATABASE Database
Database
UPDATE AGENT
Mapping Transcription factors from a
Model to a non-Model Organism
Transcription Factor
136
Transcription Factor Mapping
A A1
B B1
C C1
137
138
Methodology
139
140
Bioinformatics @ C-DAC
• Helpline: braf-help@cdac.in
Bioinformatics Application Software
for High-End Clusters and Grid
Anvaya : A Workflow Environment for High Throughput Comparative Genomics
Collaboration:
Oregon Health & Science University (USA)
¾ Collaborative project initiated with OHSU in December 2009
¾ Provide computational support to the experimental studies at OHSU,
through MD simulations on BIOGENE cluster
¾ Propeptide domain of serine protease Furin acts as a pH sensor
¾ Phenomenon has been elucidated in-silico through MD simulations
¾ Ten sets of simulations performed using NAMD
Furin Complex
Collaborations: caBIG (NIH)
155
156
Business Opportunities
• Clinical research
• Gene therapy
• Molecular science
• Pharmaceutical companies - automated technologies to
manufacture effective therapies and drugs due to increasing
concerns about drug safety and the stringent regulations that
govern clinical trials for drug discovery.
• Bioinformatics platform market – growing very fast rate
• Global bioinformatics market: ~ $8.3 billion by 2014
• Knowledge management - 2009 -$1.3 billion
• Bioinformatics platforms market - 2014 - ~ $3.9 billion
157
Business Opportunities
- Bioinformatics platforms
- Sequence alignment platforms
- Sequence manipulation platforms,
- Sequence analysis platforms
- Structural analysis platforms
- Content/Knowledge management tools
- Specialized knowledge management tools
- Generalized knowledge management tools
- Services
- Data Analysis
- Sequencing Services
- Database & Management services
- Applications
158
Thank You!