Introduction To Bioinformatics-4

Introduction to
Bioinformatics
1
Mini Exam 3
2
Mini Exam
Take a pencil and a piece of paper
Please, not too close to your

neighbour
There a three questions. You have in

total 15 minutes for writing down
short but clear answers
When you are ready please submit

your answers to the desk in front
3
Mini Exam 3
ANSWERS
4
Introduction to Bioinformatics.
LECTURE 4: Hidden Markov Models
* Chapter 4: The boulevard of broken dreams
5
Introduction to Bioinformatics
LECTURE 4: HIDDEN MARKOV MODELS
4.1 The nose knows
* In 2004 Richard Axel and Linda Buck received the Nobel price
for elucidating the olfactory system.
* Odorant Receptors (ORs): sense certain molecules outside

the cell and signal inside the cell
* ORs contain 7 transmembrane domains
* OR is single largest gene family in human genome with 1000

genes same as mice, rat, dog
* Most became pseudogenes we lost smell due to vision 6

4.2 Hidden Markov models
In 1989 Gary Churchill introduced the use of HMM for

DNA-segmentation.
CENTRAL IDEAS:
* The string is generated by a system

* The system can be a number of distinct states
* The system can change between states with probability T
* In each state the system emits symbols to the string with
probability E 7
4.2 Hidden Markov models

T(1,2) T(2,3)
STATE 1 STATE 2 STATE 3
A: pA A: pA A: pA
T: pT T: pT T: pT
C: pC C: pC C: pC
G: pG G: pG G: pG
s= TTCACTGTGAACGATCCGA CCAGTACTACGACGTTGCCAAAGCGCTTAT
h= 1111111111111111111111112222222222222333333333333333333333333 8
HMM essentials
TRANSITION MATRIX = the probability of a state change:
T (k , l ) P(hi l | hl 1 k )
EMISSION PROBABILITY = symbol probability distribution in
a certain state
E (k , b) P(si b | hi k )
9
HMM essentials
INITIAL PROBABILITY of a state :
T (0, k ) P(h1 k )
sequence of the states visited: h
sequence of the generated symbols: s
10
HMM essentials
Probability of the hidden states h:
P(h) T (0, h1 )T (h1 , h2 )T (hn1 , hn )
Probability of generated symbol string s given the hidden

states h
P(s | h) E (h1 , s1 ) E (h2 , s2 ) E (hn , sn )

11
HMM essentials
Joint probability of symbol string s and hidden states h:
P(s, h) P(s | h) P(h)
12
HMM essentials
Theorem of total probability :
P(s) P(s, h j ) P(s | h j ) P(h j )

h j H n h j H n
Most likely sequence:
h* arg maxn P(s, h)

hH
13
EXAMPLE 4.2: Change points in Labda-phage
0.0002
0.9998
0.9998
CG RICH AT RICH
0.0002
A: 0.2462 A: 0.2700
C: 0.2476 C: 0.2084
G: 0.2985 G: 0.1981
T: 0.2077 T: 0.3236
14
EXAMPLE 4.2: Change points in Labda-phage
0.0002
0.9998
0.9998
CG RICH AT RICH
0.0002
A: 0.2462 A: 0.2700
C: 0.2476 C: 0.2084
G: 0.2985 G: 0.1981
T: 0.2077 T: 0.3236
15
4.3 Profile hidden Markov models
* Characterize sets of homologous genes and proteins

based on common patterns in their sequence.
* Classis approach: multiple alignments of all elements in

the family
* Position Specific Scoring Matrices (PSSM)
* Cannot handle variable lengths or gaps
* Profile HMM (pHHM) can do this

16
* See Figure 4.4 for a pHMM for a multiple alignment of:
VIVALASVEGAS
VIVADA-VI--S
VIVADALL--AS
17
* Profile HMM (pHMM) allow to summarize the salient

features of a protein alignment in one single model
* Also pHMM can be used to produce multiple alignments
18
4.4 Finding genes with hidden Markov models
* HMMs are better in detecting genes than sequence alignment
* HMMs can detect introns and exons
* Downside: HMMs are computational much more demanding!
19
4.5 Case study: odorant receptors
* The 7-transmembrane (7-TM) G-protein coupled receptors
20
EXAMPLE 4.7: odorant receptors
P(IN-OUT)
P(IN-IN)
P(OUT-OUT)
IN OUT
P(OUT-IN)
A: 15 A: 15
R: 11 R: 11
... ...
V: 31 V: 31
21
4.6 Algorithms for HMM computations
Probability of the sequence under the given model is:
P(s) P(s, h j ) P(s | h j ) P(h j )

h j H n h j H n
the most probable sequence is:
h* arg maxn P(s, h)

hH
22
The VITERBI Dynamic Programming algorithm

Given a sequence s of length n and an HMM with params (T,E):
1. Create table V of size |H|x(n+1);

2. Initialize i=0; V(0,0)=1; V(k,0)=0 for k>0;
3. For i=1:n, compute each entry using the recursive relation:
V(j,i) = E(j,s(i))*maxk {V(k,i-1)*T(k,j) }
pointer(i,j) = arg maxk {V(k,i-1)*T(k,j) }
4. OUTPUT: P(s,h*) = maxk {V(k,n)}
5. Trace-back: i=n:1, using: h*i-1 = pointer(i, h*i)
6. OUTPUT: h*(n) = maxk {V(k,n)}
23
The FORWARD algorithm

Given a sequence s of length n and an HMM with params (T,E):
1. Create table F of size |H|x(n+1);

2. Initialize i=0; F(0,0)=1; V(k,0)=0 for k>0;
3. For i=1:n, compute each entry using the recursive relation:
F(j,i) = E(j,s(i))*k {F(k,i-1)*T(k,j) }
pointer(i,j) = arg maxk {V(k,i-1)*T(k,j) }
4. OUTPUT: P(s) = k {F(k,n)}
24
The EM (Expectation Maximization) algorithm

Given a sequence s and an HMM with unknown (T,E):
1. Initialize h, E and T;
2. Given s and h estimate E and T just by counting the symbols;
3. Given s, E and T estimate h e.g. with Viterbi-algorithm;
4. Repeat steps 2 and 3 until some criterion is met.
25
EXAMPLE:
finding genes with VEIL
26
EXAMPLE: finding genes with VEIL
The Viterbi Exon-Intron Locator (VEIL) was
developed by John Henderson, Steven Salzberg, and
Ken Fasman at Johns Hopkins University.
Gene finder with a modular structure:
Uses a HMM which is made up of sub-HMMs each to
describe a different bit of the sequence: upstream
noncoding DNA, exon, intron,
Assumes test data starts and ends with noncoding
DNA and contains exactly one gene.
Uses biological knowledge to hardwire part of HMM,
eg. start + stop codons, splice sites.
27
The exon sub-model
28
Other submodels
The start codon model is very simple:
Upstream a t g Exon
The splice junctions are also quite simple and can be
hardwired (here is the 5 splice site):
29
The overall model
Start Stop Downstream

Upstream Exon
codon codon
3 splice site intron 5 splice site 5 polyA

site
For more details, see J. Henderson, S.L.

Salzberg, and K. Fasman (1997) Journal of
Computational Biology 4:2, 127-141.
30
END of LECTURE 4
31
32
33

Introduction To Bioinformatics-4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Bioinformatics-4

Uploaded by

Copyright:

Available Formats

Introduction to

Please, not too close to your

There a three questions. You have in

When you are ready please submit

LECTURE 4: Hidden Markov Models

* Chapter 4: The boulevard of broken dreams

4.1 The nose knows

* Odorant Receptors (ORs): sense certain molecules outside

* ORs contain 7 transmembrane domains

* OR is single largest gene family in human genome with 1000

* Most became pseudogenes we lost smell due to vision 6

4.2 Hidden Markov models

In 1989 Gary Churchill introduced the use of HMM for

* The string is generated by a system

4.2 Hidden Markov models

TRANSITION MATRIX = the probability of a state change:

INITIAL PROBABILITY of a state :

sequence of the states visited: h

sequence of the generated symbols: s

Probability of the hidden states h:

P(h) T (0, h1 )T (h1 , h2 )T (hn1 , hn )

Probability of generated symbol string s given the hidden

P(s | h) E (h1 , s1 ) E (h2 , s2 ) E (hn , sn )

Joint probability of symbol string s and hidden states h:

P(s, h) P(s | h) P(h)

Theorem of total probability :

P(s) P(s, h j ) P(s | h j ) P(h j )

Most likely sequence:

h* arg maxn P(s, h)

4.3 Profile hidden Markov models

* Characterize sets of homologous genes and proteins

* Classis approach: multiple alignments of all elements in

* Position Specific Scoring Matrices (PSSM)

* Cannot handle variable lengths or gaps

* Profile HMM (pHHM) can do this

4.3 Profile hidden Markov models

* See Figure 4.4 for a pHMM for a multiple alignment of:

4.3 Profile hidden Markov models

* Profile HMM (pHMM) allow to summarize the salient

* Also pHMM can be used to produce multiple alignments

4.4 Finding genes with hidden Markov models

* HMMs are better in detecting genes than sequence alignment

* HMMs can detect introns and exons

* Downside: HMMs are computational much more demanding!

4.5 Case study: odorant receptors

* The 7-transmembrane (7-TM) G-protein coupled receptors

4.6 Algorithms for HMM computations

Probability of the sequence under the given model is:

P(s) P(s, h j ) P(s | h j ) P(h j )

the most probable sequence is:

h* arg maxn P(s, h)

The VITERBI Dynamic Programming algorithm

1. Create table V of size |H|x(n+1);

The FORWARD algorithm

1. Create table F of size |H|x(n+1);

The EM (Expectation Maximization) algorithm

Start Stop Downstream

3 splice site intron 5 splice site 5 polyA

For more details, see J. Henderson, S.L.

You might also like