You are on page 1of 33

Introduction to

Bioinformatics

1
Mini Exam 3

2
Mini Exam
Take a pencil and a piece of paper

Please, not too close to your


neighbour

There a three questions. You have in


total 15 minutes for writing down
short but clear answers

When you are ready please submit


your answers to the desk in front

3
Mini Exam 3
ANSWERS

4
Introduction to Bioinformatics.

LECTURE 4: Hidden Markov Models

* Chapter 4: The boulevard of broken dreams

5
Introduction to Bioinformatics
LECTURE 4: HIDDEN MARKOV MODELS

4.1 The nose knows

* In 2004 Richard Axel and Linda Buck received the Nobel price
for elucidating the olfactory system.

* Odorant Receptors (ORs): sense certain molecules outside


the cell and signal inside the cell

* ORs contain 7 transmembrane domains

* OR is single largest gene family in human genome with 1000


genes same as mice, rat, dog

* Most became pseudogenes we lost smell due to vision 6


Introduction to Bioinformatics
LECTURE 4: HIDDEN MARKOV MODELS

4.2 Hidden Markov models

In 1989 Gary Churchill introduced the use of HMM for


DNA-segmentation.

CENTRAL IDEAS:

* The string is generated by a system


* The system can be a number of distinct states
* The system can change between states with probability T
* In each state the system emits symbols to the string with
probability E 7
Introduction to Bioinformatics
LECTURE 4: HIDDEN MARKOV MODELS

4.2 Hidden Markov models


T(1,2) T(2,3)
STATE 1 STATE 2 STATE 3

A: pA A: pA A: pA
T: pT T: pT T: pT
C: pC C: pC C: pC
G: pG G: pG G: pG

s= TTCACTGTGAACGATCCGA CCAGTACTACGACGTTGCCAAAGCGCTTAT
h= 1111111111111111111111112222222222222333333333333333333333333 8
Introduction to Bioinformatics
LECTURE 4: HIDDEN MARKOV MODELS

HMM essentials

TRANSITION MATRIX = the probability of a state change:

T (k , l ) P(hi l | hl 1 k )
EMISSION PROBABILITY = symbol probability distribution in
a certain state

E (k , b) P(si b | hi k )

9
Introduction to Bioinformatics
LECTURE 4: HIDDEN MARKOV MODELS

HMM essentials

INITIAL PROBABILITY of a state :

T (0, k ) P(h1 k )

sequence of the states visited: h

sequence of the generated symbols: s

10
Introduction to Bioinformatics
LECTURE 4: HIDDEN MARKOV MODELS

HMM essentials

Probability of the hidden states h:

P(h) T (0, h1 )T (h1 , h2 )T (hn1 , hn )

Probability of generated symbol string s given the hidden


states h

P(s | h) E (h1 , s1 ) E (h2 , s2 ) E (hn , sn )


11
Introduction to Bioinformatics
LECTURE 4: HIDDEN MARKOV MODELS

HMM essentials

Joint probability of symbol string s and hidden states h:

P(s, h) P(s | h) P(h)

12
Introduction to Bioinformatics
LECTURE 4: HIDDEN MARKOV MODELS

HMM essentials

Theorem of total probability :

P(s) P(s, h j ) P(s | h j ) P(h j )


h j H n h j H n

Most likely sequence:

h* arg maxn P(s, h)


hH

13
EXAMPLE 4.2: Change points in Labda-phage

0.0002
0.9998
0.9998
CG RICH AT RICH

0.0002

A: 0.2462 A: 0.2700
C: 0.2476 C: 0.2084
G: 0.2985 G: 0.1981
T: 0.2077 T: 0.3236

14
EXAMPLE 4.2: Change points in Labda-phage

0.0002
0.9998
0.9998
CG RICH AT RICH

0.0002

A: 0.2462 A: 0.2700
C: 0.2476 C: 0.2084
G: 0.2985 G: 0.1981
T: 0.2077 T: 0.3236

15
Introduction to Bioinformatics
LECTURE 4: HIDDEN MARKOV MODELS

4.3 Profile hidden Markov models

* Characterize sets of homologous genes and proteins


based on common patterns in their sequence.

* Classis approach: multiple alignments of all elements in


the family

* Position Specific Scoring Matrices (PSSM)

* Cannot handle variable lengths or gaps

* Profile HMM (pHHM) can do this


16
Introduction to Bioinformatics
LECTURE 4: HIDDEN MARKOV MODELS

4.3 Profile hidden Markov models

* See Figure 4.4 for a pHMM for a multiple alignment of:

VIVALASVEGAS
VIVADA-VI--S
VIVADALL--AS

17
Introduction to Bioinformatics
LECTURE 4: HIDDEN MARKOV MODELS

4.3 Profile hidden Markov models

* Profile HMM (pHMM) allow to summarize the salient


features of a protein alignment in one single model

* Also pHMM can be used to produce multiple alignments

18
Introduction to Bioinformatics
LECTURE 4: HIDDEN MARKOV MODELS

4.4 Finding genes with hidden Markov models

* HMMs are better in detecting genes than sequence alignment

* HMMs can detect introns and exons

* Downside: HMMs are computational much more demanding!

19
Introduction to Bioinformatics
LECTURE 4: HIDDEN MARKOV MODELS

4.5 Case study: odorant receptors

* The 7-transmembrane (7-TM) G-protein coupled receptors

20
EXAMPLE 4.7: odorant receptors

P(IN-OUT)

P(IN-IN)
P(OUT-OUT)
IN OUT

P(OUT-IN)

A: 15 A: 15
R: 11 R: 11
... ...
V: 31 V: 31

21
Introduction to Bioinformatics
LECTURE 4: HIDDEN MARKOV MODELS

4.6 Algorithms for HMM computations

Probability of the sequence under the given model is:

P(s) P(s, h j ) P(s | h j ) P(h j )


h j H n h j H n

the most probable sequence is:

h* arg maxn P(s, h)


hH

22
Introduction to Bioinformatics
LECTURE 4: HIDDEN MARKOV MODELS

The VITERBI Dynamic Programming algorithm


Given a sequence s of length n and an HMM with params (T,E):

1. Create table V of size |H|x(n+1);


2. Initialize i=0; V(0,0)=1; V(k,0)=0 for k>0;
3. For i=1:n, compute each entry using the recursive relation:
V(j,i) = E(j,s(i))*maxk {V(k,i-1)*T(k,j) }
pointer(i,j) = arg maxk {V(k,i-1)*T(k,j) }
4. OUTPUT: P(s,h*) = maxk {V(k,n)}
5. Trace-back: i=n:1, using: h*i-1 = pointer(i, h*i)
6. OUTPUT: h*(n) = maxk {V(k,n)}
23
Introduction to Bioinformatics
LECTURE 4: HIDDEN MARKOV MODELS

The FORWARD algorithm


Given a sequence s of length n and an HMM with params (T,E):

1. Create table F of size |H|x(n+1);


2. Initialize i=0; F(0,0)=1; V(k,0)=0 for k>0;
3. For i=1:n, compute each entry using the recursive relation:
F(j,i) = E(j,s(i))*k {F(k,i-1)*T(k,j) }
pointer(i,j) = arg maxk {V(k,i-1)*T(k,j) }
4. OUTPUT: P(s) = k {F(k,n)}
24
Introduction to Bioinformatics
LECTURE 4: HIDDEN MARKOV MODELS

The EM (Expectation Maximization) algorithm


Given a sequence s and an HMM with unknown (T,E):

1. Initialize h, E and T;
2. Given s and h estimate E and T just by counting the symbols;
3. Given s, E and T estimate h e.g. with Viterbi-algorithm;
4. Repeat steps 2 and 3 until some criterion is met.

25
EXAMPLE:
finding genes with VEIL

26
EXAMPLE: finding genes with VEIL
The Viterbi Exon-Intron Locator (VEIL) was
developed by John Henderson, Steven Salzberg, and
Ken Fasman at Johns Hopkins University.
Gene finder with a modular structure:
Uses a HMM which is made up of sub-HMMs each to
describe a different bit of the sequence: upstream
noncoding DNA, exon, intron,
Assumes test data starts and ends with noncoding
DNA and contains exactly one gene.
Uses biological knowledge to hardwire part of HMM,
eg. start + stop codons, splice sites.

27
The exon sub-model

28
Other submodels
The start codon model is very simple:
Upstream a t g Exon
The splice junctions are also quite simple and can be
hardwired (here is the 5 splice site):

29
The overall model

Start Stop Downstream


Upstream Exon
codon codon

3 splice site intron 5 splice site 5 polyA


site

For more details, see J. Henderson, S.L.


Salzberg, and K. Fasman (1997) Journal of
Computational Biology 4:2, 127-141.
30
END of LECTURE 4

31
Introduction to Bioinformatics
LECTURE 4: HIDDEN MARKOV MODELS

32
33

You might also like