You are on page 1of 34

Bioinformatics Tools

Byoung-Tak Zhang and Chul Joo Kang


School of Computer Science and Engineering
Seoul National University

Contents
1. Sequence Alignment
2. Multiple Sequence Alignment
3. Pattern Finding
4. Structure Prediction
5. DNA Microarray
6. Major Tools for Proteomics

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

Bioinformatics Tools
Problems

Tools or Database

Sequence Alignment

BLAST, FASTA

Multiple Sequence
Alignment

Clustal W, Macaw

Pattern Finding

GRAIL, FGENEH, tRNAscan-SE, NNPP,


eMOTIF, PROSITE, ChloroP

Structure Prediction

Bend.it, RNA Draw, NNPREDICT, SWISSMODEL

DNA Microarray

GeneX, GOE, MAT, GeNet

Hardwares for
Proteomics

2D Gel, MALDI-TOF
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

Bioinformatics Based on Sequence


!

Genome sequencing project: get huge sequence


data

Rretrieving useful information from sequence data


- Find genes and other elements
- Classification
- Predict its function
- Others

Read the articles on Nature and Science


(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

1. Sequence Alignment

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

Pairwise Sequence Alignment


!
!
!
!

!
!
!

Basic job handling sequences


Alignment between two sequences
Sequence database search
Finding similar DNA or protein sequences

Smith-Waterman algorithm
BLAST
FASTA
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

Smith-Waterman Algorithm (1)


!
!

Finding local alignments


Using dynamic programming

TCAT*G
*CATTG

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

Smith-Waterman Algorithm (2)


!
!
!
!

Best results and slow performance


Can grasp results missed by BLAST or FASTA.
Available on web:
spiral.genes.nig.ac.jp/homolgy/ssearch-e.shtml
Ref : Smith, T. F. and Watermann, M. S. (1981)
Identification of common molecular subsequence.
Journal of Molecular Biology, 147: 196-197.

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

BLAST & FASTA


!
!
!
!

Using heuristic algorithms


Word based match
Faster than Smith &Waterman
BLAST: www.ncbi.nlm.nih.gov/BLAST
Ref: Altschul, S. F., et al. (1990) Basic local alignment
search tool. Journal of Molecular Biology, 215, 403410.
FASTA: www.ebi.ac.uk/fasta3
Ref: Wilbur, W. J. and Lipman, D. J. (1983) Rapid
similarity searches of nucleic acid and protein data
banks. Proceedings of the National Academy of
Sciences USA 80, 726-30.
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

The BLAST Search


For the query, find the list
of high scoring words of
length W.
Compares the word list to
the database and identifies
exact matches.
For each word match,
extends the alignment in
both directions to find
alignments that score
greater than a threshold of
value S.

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

10

BLAST Result
Program: BLASTP
Query: human MYB
binding protein
Database: Swissprot

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

11

2. Multiple Sequence Alignment

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

12

Multiple Sequence Alignment


!
!

Align multiple sequences


Finding well conserved regions
4Finding motifs or other elements

!
!
!

Building gene families


Constructing a phylogenetic tree
Using whole dynamic programming ?
4O(n^k) problem (n = sequence length, k = number of
sequences)

!
!

Clustal W
Macaw
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

13

Clustal W
!
!
!
!

Step 1: Pairwise alignment between two sequences


Step 2: Making sequence weights and clusters
Step 3: Alignment between most similar sequences or
clusters
Step 4: Making one optimal output sequence
Ref: Higgins, D. G. and Sharp, P. H. (1988) CLUSTAL: A
package for performing multiple sequence alignment
on a microcomputer. Gene 73, 237-244.

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

14

Multiple Alignment by Clustal W

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

15

Phylogenetic Tree Construction Based


on Clustal W Results

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

16

3. Pattern Finding

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

17

Pattern Finding
! DNA

sequences have all information about


their products
4Gene finding
4Other sequence element finding
4Motif finding
4Localization prediction
4And others

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

18

Gene Structure

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

19

Gene Finding
! From

an unannotated DNA sequence, find


putative expressive regions
! Distinguish exon and intron regions (in
eukaryote).
! Solutions
4Compare known genes (alignments)
4Simply find conserved region
4Using statistical models like hidden Markov
models and others
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

20

GRAIL: Find Exon and Intron


!
!
!
!
!

!
!

In eukaryotes, RNA is processed.


Intron: splice out
Exon: join together
GU AG rule
GRAIL: exon prediction from a
genomic sequence with
RepeatMasker filtering
Neural networks which combine a
series of coding prediction
algorithms
http://compbio.ornl.gov/Grail-1.3
Ref: Einstein et al. (1992) Computerbased construction of gene using
the GRAIL gene assembly program.
Oak Ridge Natl. Lab. Report TM12174.

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

21

GRAIL Results

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

22

FGENEH: Prediction of Multiple Genes


in Human DNA Sequences
!
!
!

Finding genes from DNA sequences


HMM based human gene structure prediction
Available on web:
http://searchlauncher.bcm.tmc.edu/seq-search/genesearch.html
Ref: Algorithm is trained on our data and based on HMM
similar with Genescan (Burge, C. and Karlin, S. (J.
Mol. Biol. 1997, 268, 78-94.) and Genie (Kulp, D.,
Haussler, D., Reese, M.G., and Eeckman, F. H. (1996),
Proc. Conf. on Intelligent Systems in Molecular
Biology '96, 134-142).
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

23

FGENEH: Prediction of Multiple Genes


in Human DNA Sequences

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

24

tRNAscan-SE: Find tRNA Genes


!
!

!
!
!

Finding tRNA from genomic sequence


Finding polII promoter sites, searching conserved
secondary structures - easier problem than finding protein
coding genes
Combining several earlier programs and algorithms
Http://www.genetics.wustl.edu/eddy/tRNAscan-SE/
Ref: Lowe, T. M. and Eddy. S. R. (1997) tRNAscan-SE:
A program for improved detection of transfer RNA
genes in genomic sequence. Nucl. Acids Res. 25: 955964.

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

25

tRNAscan-SE Results

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

26

Sequence Element Finding

* Promoter region of E.coli lacZ operon

Promoter and other elements: regulatory and other protein binding


sites
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

27

NNPP: Neural Network Promoter


Prediction
!
!
!

Finding promoter sequences on DNA sequences


Time-delay neural networks
Two feature layers
4Recognizing TATA boxes
4Recognizing initiators

http://www.fruitfly.org/seq_tools/promoter.html

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

28

Neural Network Promoter Prediction

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

29

Motif Finding
!
!

Find conserved residues from DNA or protein sequences.


Commonly, conserved residues are related to protein functions.

Solutions:
Finding consensus sequences
Alignment
Position specific weight matrix
Hidden Markov models
Motif of one class of zinc finger protein
(C2H2)
Cx{2,4}Cx{12}Hx{3,5}H
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

30

Motif Databases: eMOTIF, PROSITE


!

Prosite, eMotif: protein motif databases

eMotif (motif.stanford.edu/emotif)
eMotif Maker: building new motif from multiple alignment
eMotif scan: find motif pattern
eMotif search: find motif from protein sequence

Ref: Huang, J. Y. and Brutlag, D. L. (2001) The eMOTIF Database.


Nucleic Acids Research, 29(1), 202-204
Craig et al. (1998) Highly specific protein sequence motif for
genomic analysis. Proc. Natl. Acad. Sci. 95: 5865-5871

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

31

eMOTIF (1)

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

32

eMOTIF (2)

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

33

eMOTIF (3)

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

34

PROSITE (1)
!
!

PROSITE: Database of protein families and


domains (http://www.expasy.ch/prosite/)
Consists of biologically significant sites, patterns
and profiles that help to reliably identify to which
known protein family (if any) a new sequence
belongs.

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

35

PROSITE (2)
e.g.) C2H2 zinc finger
protein

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

36

ScanPROSITE
!

Compares query sequences


(protein) to Prosite.
! Find patterns

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

37

Others: Find Gene Destination


!
!

After translation, proteins are located to proper places


(nuclear, mitochondria, outside of cell)
That process determined by amino acid sequences on
proteins
4 Finding signal peptide
4 Predicting protein targets

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

38

ChloroP: Prediction of Chloroplast


Importing Protein (1)
!
!
!
!

Predicting chloroplast targeting transit peptides


Predicting cleavage sites
Neural network based method
Ref: Emanuelsson, O., Nilesen, H., and von
Heijne, G. (1999) A neural network-based
method for predicting chloroplast transit
peptides and their cleavage sites. Protein
Science 8, 978-984.
Service site:
http://www.cbs.dtu.dk/services/ChloroP/
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

39

ChloroP: Prediction of Chloroplast


Importing Protein (2)

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

40

4. Structure Prediction

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

41

Structure Prediction
!
!
!

DNA, RNA and protein structure prediction


These structures affect their function, stability, etc.
Protein structure prediction by calculation of
biochemical properties of each amino acid
4In most cases of protein secondary or tertiary structure
prediction, it is a terribly huge computing job (almost
impossible)

Prediction based on known structures

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

42

DNA & RNA Structure


!
!

DNA & RNA strands: dynamic one, not linear form


DNA: regionally denatured (breathing), bending
4 Affects its expressions and others.

RNA: forming intra strand pairing


4 Affects its stability, function (RNase), and others.

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

43

DNA Bending
!
!
!

DNA structure is flexible.


DNA double strand is
bended by protein binding.
The bending of the DNA
structure will influence
other proteins to bind to it.

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

44

Bend.it: Predict DNA Bendability


!
!
!
!

Predicting DNA curvatures from DNA sequences


The curvature is calculated as a vector sum of
dinucleotide geometries (roll, tilt and twist angles).
Expressed as degrees per helical turn.
Ref: Munteanu, M. G., et al. (1998) Rod models of
DNA: sequence-dependent anisotropic elastic
modelling of local bending phenomena.
Trends Biochem. Sci. 23(9), 341-346
Service site:
http://www2.icgeb.trieste.it/~dna/bend_it.html
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

45

RNA Draw: Predict RNA Secondary


Structure on PC (1)
!
!

Predicting RNA secondary conformation under given E


state
Using dynamic programming, base pair probability and E
parameters

Clover leaf structure of tRNA and its 3D structure


(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

46

RNA Draw: Predict RNA Secondary


Structure on PC (2)
E.coli tRNA-Ile
predicted
structure on 37 C

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

47

Protein Structure

Primary structure (amino acid


sequence) determine its
secondary structure (partial
structure: helix, -sheet) and
tertiary structure

Protein function: determined by its structure


(e.g.: enzyme inactivated by heat - because its structure
was changed by heat.)
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

48

Secondary Structure Prediction


!

Prediction of secondary
structure
4 -helix
4 -sheet

!
!

Can be helpful for its 3D


conformation study.
Solutions
4 Calculate propensity for each
amino acid to be in helix, sheet
4 Calculate information content
4 Neural network based method
4 Nearest neighbor
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

49

NNPREDICT: Protein Secondary


Structure Prediction
!
!

Two-layer, feed-forward neural network


Finding secondary structure from amino acid sequences

http://www.cmpharm.ucsf.edu/~nomi/nnpredict.html
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

50

Tertiary Structure Prediction

!
!

Using E state?: only select lowest E state - impossible


Using homology modeling: similar sequence = similar
structure
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

51

SWISS-MODEL
!

Predicting protein 3D structure

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

52

Swiss-Pdb Viewer

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

53

SWISS-MODEL
!

Ref: Guex, N. and Peitsch, M. C. (1997) SWISSMODEL and the Swiss-PdbViewer: An


environment for comparative protein modelling.
Electrophoresis 18:2714-2723.
Guex, N. and Peitsch, M. C. (1999) Molecular
modelling of proteins. Immunology News 6:132-134.
Available on web:
http://www.expasy.ch/swissmod/SWISS-MODEL.html

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

54

5. DNA Microarray

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

55

DNA Microarray (1)

!
!
!

Monitoring of gene expression


cDNA chip: spotting cDNAs
Oligonucleotide chip: spotting
short DNA sequences

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

56

DNA Microarray: cDNA chip (2)


!
!

DNAs are spotted on glass surfaces.


Applications: drug effects, drug metabolism, disease diagnosis, finding
gene pathway, finding disease genes, etc.

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

57

DNA Microarray: oligonucleotide chip (3)


!
!

Short (10~25 nt) oligonucleotides on chip


Applications: find sNPs, detect mutation, sequencing by chip,
genotyping

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

58

DNA Microarray (4)


!

Processing DNA chip images to obtain qualified data

Data mining from DNA chip images like clustering (hierarchical


algorithms, partitioning algorithms)

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

59

DNA Microarray (4)


!

Ref: Schena, M., et al. (1996) Parallel human


genome analysis: Microarray-based expression
monitoring of 1000 genes. Proc Natl. Acad. Sci.
USA. 93(20):10614-9.
Useful websites:
4www.affimatrix.com
4www.genechip.co.kr
4inkage.rockefeller.edu/wli/microarray
4cmgm.stanford.edu/~plf58/microArray

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

60

6. Major Tools for Proteomics

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

61

Major Tools for Proteomics: 2D Gel &


MALDI-TOF
!

Proteomics:
4 Protein expression profile
4 Protein interaction
4 Protein structure
4 Protein variation

Why proteomics?
4 Post-Genomics Era: Understanding functions
4 Applications in target discovery
4 Novel proteins and genes
4 Disease associated proteins
4 Elucidate pathways and processes in disease/cell biology
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

62

2-D Gel Electrophoresis


!
!
!

Separates proteins by their netcharge and


molecular weights.
Displays whole proteins
Like DNA chip, compares expression under
different treatments, and detects tissues or cell
specific expressions.
4Finding specific proteins for further study

Poor reproduction problem

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

63

2-D Gel Image of Human Liver

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

64

2-D Electrophoretic Gel Images

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

65

MALDI-TOF: Matrix-Assisted Laser


Desorption Ionization
!
!

Detection of protein molecular weights (mass


spectrophotometer)
Can determine protein sequence.

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

66

Other bioinformatic work


!

Sequence assembly
4 From scattered DNA sequences, construct full sequence

Specific database warehousing


4 Structural databases, protein block databases, drug databases

sNPs
4 Find single nucleotide polymorphisms in population

Construct gene to gene, protein to protein and gene to


protein interaction pathway

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

67

You might also like