You are on page 1of 21

Probability and Markov Models

Timothy L. Bailey BIOL3014

Reading
Chapter 1 in the book. Chapter 3 pages 46-51.

Timothy L. Bailey BIOL3014

Definition: Random Process


A RANDOM PROCESS is something that has a random outcome: Roll a die, flip a coin, roll 2 dice Observe orthologous base pair in 2 seqs Measure an mRNA level Weigh a person

Timothy L. Bailey BIOL3014

Definition: Experiment

In probability theory, an EXPERIMENT is a single observation of a random process.

Timothy L. Bailey BIOL3014

Definition: Event
An EVENT is a set of possible outcomes of an experiment. An ELEMENTARY EVENT is whatever you decide it is. For example:
The outcome of 1 roll of a die The outcomes of n rolls of a die The residue at position 237 in a protein The residues at position 237 in a family of proteins The weight of a person

Elementary events must be non-overlapping!


Timothy L. Bailey BIOL3014 5

Compound Events
A COMPOUND EVENT is a set of one or more elementary events. For example, you might define two compound events in a die-rolling experiment: E=roll less than 3, F=roll greater than or equal to 3. Then, E = {1, 2} and F = {3, 4, 5,6}.
Timothy L. Bailey BIOL3014 6

Defn: Sample Space


The SAMPLE SPACE is the set of all ELEMENTARY EVENTS. So the sample space is the universe of all possible outcomes of the experiment. This is written:
= { Ei}

For example, for rolls of a die, you might have:


= {1, 2, 3, 4, 5, 6}
Timothy L. Bailey BIOL3014 7

Discrete vs. Continuous Events


The sample space might be INFINITE. For example, the weight of person can be any real number greater than 0. Some events are DISCRETE: countable
Base pairs, residues, die rolls

Other events are CONTINUOUS: eg, real numbers


Weights, alignment scores, mRNA levels
Timothy L. Bailey BIOL3014 8

The Axioms of Probability


Let E and F be events. Then the axioms of probability are:
1. 2. 3. 4. Pr(E) 0 Pr() = 1 Pr(E U F) = Pr(E) + Pr(F) if (E F) = empty set Pr(E | F) Pr(F) = Pr (E F)
E F
Timothy L. Bailey BIOL3014

E F

EUF

EF

Probability Is like area in Venn diagrams

Notation
Joint Probability: Pr(E,F)
The probability of E and F

Conditional probability: Pr(E | F)


The probability of E given F

Timothy L. Bailey BIOL3014

10

Conditional Probability and Bayes Rule


Conditional probability can be defined as: Pr(E | F) = Pr (E,F) / Pr(F)

Bayes Rule can be used to reverse the roles of E and F: Pr(F | E) = Pr (E|F) Pr(F) / Pr(E)

Timothy L. Bailey BIOL3014

11

Sequence Models
Observed biological sequences (DNA, RNA, protein) can be thought of as the outcomes of random processes. So, it makes sense to model sequences using probabilistic models. You can think of a sequence model as a little machine that randomly generates sequences.
Timothy L. Bailey BIOL3014 12

A Simple Sequence Model


Imagine a tetrahedral (four-sided) die with the letters A, C, G and T on its sides. You roll the die 100 times and write down the letters that come up (down, actually). This is a simple random sequence model.

Timothy L. Bailey BIOL3014

13

Zero-order Markov Model


The four-sided die model is called a 0-order Markov model. It can be drawn thus:
p=1 Transition probability

0-order Markov Sequence model

Timothy L. Bailey BIOL3014

qA qC qG qT

Emission Probabilites
14

Complete 0-order Markov Model


To model the length of the sequences that the model can generate, we need to add start and end states.
p

1-p

Complete 0-order Markov Sequence model


Timothy L. Bailey BIOL3014

qA qC qG qT

15

Generating a Sequence
This Markov model can generate any DNA sequence. Associated with each sequence is a path and a probability. 1. Start in state S: P = 1 2. Move to state M: P=1P 3. Print x: P = qXP 4. Move to state M: P=pP or to state E: P=(1-p) P 5. If in state M, go to 3. If in state E, stop.
p

1-p

qA qC qG qT

Timothy L. Bailey BIOL3014

Sequence: GCAGCT Path: S, M, M, M, M, M, M, E P=1qGpqCpqApqGpqCpqT(1p) 16

Using a 0-order Markov Model


This model can generate any DNA sequence, so it can be used to model DNA. We used it when we created scoring matrices for sequence alignment as the background model. Its a pretty dumb model, though. DNA is not very well modeled by a 0-order Markov model because the probability of seeing, say, a G following a C is usually different than a G following an A, (e.g, in CpG islands.) So we need a better models: higher order Markov models.

Timothy L. Bailey BIOL3014

17

Markov Model Order


p

This simple sequence model is called a 0-order Markov model because the probability distribution of the next letter to be generated doesnt depend on any (zero) of the letters preceding it.

1-p

The Markov Property: Let X = X1X2XL be a sequence. In an n-order Markov sequence model, the probability distribution of the next letter depends on the previous n letters generated. 0-order: Pr(Xi|X1X2Xi-1)=Pr(Xi) 1-order: Pr(Xi|X1X2Xi-1)=Pr(Xi|Xi-1) n-order: Pr(Xi|X1X2Xi-1)=Pr(Xi|Xi-1Xi-2Xi-n)
Timothy L. Bailey BIOL3014 18

qA qC qG qT

A 1-order Markov Sequence Model


In a first-order Markov sequence model, the probability of the next letter depends on what the previous letter generated was. We can model this by making a state for each letter. Each state always emits the letter it is labeled with. (Not all transitions are shown.)
Pr(A|A) Pr(T|T)

Pr(T|A) A S Pr(G|G) T Pr(C|C) E

Timothy L. Bailey BIOL3014

Pr(C|G)

19

A 2-order Markov Model


To make a second order Markov sequence model, each state is labelled with two letters. It emits the second letter in its label. There would have to be sixteen states: AA, AC, AG, AT, CA, CG, CT etc., plus four states for the first letter in the sequence: A, C, G, T Each state would have transitions only to states whose first letter matched their second letter.

Timothy L. Bailey BIOL3014

20

Part of a 2-order Model


Each state remembers what the previous letter emitted was in its label.
Pr(A|AA)

Pr(T|AA) AA Pr(G|AA) AG
Timothy L. Bailey BIOL3014

AT E

AC

21

You might also like