You are on page 1of 97

Gene prediction in prokaryotes!

Dmitrij Frishman
Technische Universitt Mnchen
Gene prediction: a mind map!
Prokarya!
From DeRisi et al., 1997
>gb|L43967|MGEN Mycoplasma genitalium G37 complete genome
TAAGTTATTATTTAGTTAATACTTTTAACAATATTATTAAGGTATTTAAAAAATACTATT
ATAGTATTTAACATAGTTAAATACCTTCCTTAATACTGTTAAATTATATTCAATCAATAC
ATATATAATATTATTAAAATACTTGATAAGTATTATTTAGATATTAGACAAATACTAATT
TTATATTGCTTTAATACTTAATAAATACTACTTATGTATTAAGTAAATATTACTGTAATA
CTAATAACAATATTATTACAATATGCTAGAATAATATTGCTAGTATCAATAATTACTAAT
ATAGTATTAGGAAAATACCATAATAATATTTCTACATAATACTAAGTTAATACTATGTGT
AGAATAATAAATAATCAGATTAAAAAAATTTTATTTATCTGAAACATATTTAATCAATTG
AACTGATTATTTTCAGCAGTAATAATTACATATGTACATAGTACATATGTAAAATATCAT
TAATTTCTGTTATATATAATAGTATCTATTTTAGAGAGTATTAATTATTACTATAATTAA
GCATTTATGCTTAATTATAAGCTTTTTATGAACAAAATTATAGACATTTTAGTTCTTATA
ATAAATAATAGATATTAAAGAAAATAAAAAAATAGAAATAAATATCATAACCCTTGATAA
CCCAGAAATTAATACTTAATCAAAAATGAAAATATTAATTAATAAAAGTGAATTGAATAA
AATTTTGGGAAAAAATGAATAACGTTATTATTTCCAATAACAAAATAAAACCACATCATT
CATATTTTTTAATAGAGGCAAAAGAAAAAGAAATAAACTTTTATGCTAACAATGAATACT
TTTCTGTCAAATGTAATTTAAATAAAAATATTGATATTCTTGAACAAGGCTCCTTAATTG
TTAAAGGAAAAATTTTTAACGATCTTATTAATGGCATAAAAGAAGAGATTATTACTATTC
AAGAAAAAGATCAAACACTTTTGGTTAAAACAAAAAAAACAAGTATTAATTTAAACACAA
TTAATGTGAATGAATTTCCAAGAATAAGGTTTAATGAAAAAAACGATTTAAGTGAATTTA
ATCAATTCAAAATAAATTATTCACTTTTAGTAAAAGGCATTAAAAAAATTTTTCACTCAG
TTTCAAATAATCGTGAAATATCTTCTAAATTTAATGGAGTAAATTTCAATGGATCCAATG
GAAAAGAAATATTTTTAGAAGCTTCTGACACTTATAAACTATCTGTTTTTGAGATAAAGC
AAGAAACAGAACCATTTGATTTCATTTTGGAGAGTAATTTACTTAGTTTCATTAATTCTT
TTAATCCTGAAGAAGATAAATCTATTGTTTTTTATTACAGAAAAGATAATAAAGATAGCT
TTAGTACAGAAATGTTGATTTCAATGGATAACTTTATGATTAGTTACACATCGGTTAATG
AAAAATTTCCAGAGGTAAACTACTTTTTTGAATTTGAACCTGAAACTAAAATAGTTGTTC
AAAAAAATGAATTAAAAGATGCACTTCAAAGAATTCAAACTTTGGCTCAAAATGAAAGAA
CTTTTTTATGCGATATGCAAATTAACAGTTCTGAATTAAAAATAAGAGCTATTGTTAATA
ATATCGGAAATTCTCTTGAGGAAATTTCTTGTCTTAAATTTGAAGGTTATAAACTTAATA
TTTCTTTTAACCCAAGTTCTCTATTAGATCACATAGAGTCTTTTGAATCAAATGAAATAA
ATTTTGATTTCCAAGGAAATAGTAAGTATTTTTTGATAACCTCTAAAAGTGAACCTGAAC
TTAAGCAAATATTGGTTCCTTCAAGATAATGAATCTTTACGATCTTTTAGAACTACCAAC
TACAGCATCAATAAAAGAAATAAAAATTGCTTATAAAAGATTAGCAAAGCGTTATCACCC
TGATGTAAATAAATTAGGTTCGCAAACTTTTGTTGAAATTAATAATGCTTATTCAATATT
AAGTGATCCTAACCAAAAGGAAAAATATGATTCAATGCTGAAAGTTAATGATTTTCAAAA
TCGCATCAAAAATTTAGATATTAGTGTTAGATGACATGAAAATTTCATGGAAGAACTCGA
ACTTCGTAAGACCTGAGAATTTGATTTTTTTTCATCTGATGAAGATTTCTTTTATTCTCC
From DNA sequence... ...to structure and function
of gene products
Linear representation of the Buchnera sp.!
Circular genome map showing the position and orientation of
known genes, pseudogenes and repetitive sequences!
From the outside:
!circles 1 and 2 (clockwise and
anticlockwise) genes on the - and +
strands, respectively
!circles 3 and 4, pseudogenes
!5 and 6, M. leprae specific genes
!7, repeat sequences
!8, G+C content
!9, G/C bias (G+C)/(G-C)
Genes are color coded using the following
functional categories: lipid metabolism
(dark grey); intermediary metabolism and
respiration (yellow); information pathways
(red); regulatory proteins (light blue);
conserved hypothetical proteins (orange);
proteins of unknown function (light green);
insertion sequences and phage related
functions (pink); stable RNAs (dark blue);
cell wall and cell processes (green); PE
and PPE protein families (magenta);
virulence, detoxification, adaptation
(brown).
Cole et al., 2001
Gene prediction owchart!
Obtain new
genomic
DNA
sequence
1. Translate in all six
reading frames and
compare to proten
sequence database
2. Perform database
similarity search of
expressed sequence
tag (EST) database
of same organism, or
cDNA sequences if
available
Use gene
prediction
program to
locate genes
Analyze
regulatory
sequence in
the gene
D. Mount, 2001
In bacteria!
Chris Burge
Gene Expression I: Transcription!
Ribosome!
!! Translation is handled by a molecular complex,
ribosome, which consists of both proteins &
ribosomal RNA (rRNA)!
!! Ribosome reads mRNA & the translation starts at a
start codon (the translation start site)!
!! With help of tRNA, each codon is translated to an
amino acid!
!! Translation stops once ribosome reads a stop codon
(the translation stop site)!
The Genetic Code!
The structure of a bacterial gene. Single gene
transcript.!
ATG
TAA
transcript
Coding sequence
Transcription start site
Promoter
Ribosome binding site/translational start (ATG)
hisG
Translational end
Transcriptional terminator
ATG
TAA
transcript
Transcription start site
hisG
Translational end
hisH
Multigene bacterial operon
-One promoter, one transcriptional stop; multiple translational starts and stops
TAA
Promoters
!! Promoters are DNA segments upstream of
transcripts that initiate transcription
!! Promoter attracts RNA Polymerase to the
transcription start site
5
Promoter
3
www.bioalgorithms.info
Promoter Structure in Prokaryotes (E.coli)!
Transcription starts
at offset 0.
! Pribnow Box (-10)
! Gilbert Box (-30)
! Ribosomal
Binding Site (+10)
www.bioalgorithms.info
Figure 13.14 Genomes 3 ( Garland Science 2007)
The Shine-Dalgarno sequence!
!! The ribosome binds to the
messenger RNA through
baseparing to the 30S ribosomal
subunit.!
!! The binding site is the Shine-
Dalgarno sequence (SD).!
!! The SD is a purine-rich sequence
(consensus sequence: AGGAG) at
the 5' end of most prokaryotic
mRNAs. !
!! The SD is found 5-10 basepairs
upstream from the start codon.!
Ribosome binding sites!
A logo of the ribosome-binding sites and start codon in E.coli genes
Probability distribution of spacer
length between RBS and the gene
start
Besemer et al., 2001
>AE006641
GTATACTCTTCTTCCCTATACATTGTCGCAGCAAGCTTAGTTTCTTTAGCCTCTCTGCTTTCATTATTAC
TTATAATCTTAATAGCAAGGAGACATATGATAGAGTATTTCTATATGATTCCTTCGTTCGTTTATATGAA
CTTTATTGTCGCACTAAACTTCACTGCAATATTTTTAGAGTTAATAAGAGCACCTAGAGTGTGGGTAAAA
ACTGAAAGAAGTGCCAAGGTTACGGGGGAGGTCATGGGATGATAACTGAATTTTTACTTAAAAAGAAATT
AGAAGAACATTTAAGCCATGTAAAGGAAGAGAATACGATATATGTAACAGATTTAGTAAGATGCCCCAGA
AGAGTAAGATATGAGAGTGAATACAAGGAGCTTGCAATCTCTCAGGTTTACGCGCCTTCAGCTATTTTAG
GGGACATATTGCATCTCGGTCTTGAAAGCGTATTAAAAGGGAACTTTAATGCAGAAACTGAAGTTGAAAC
TCTGAGAGAAATTAACGTCGGAGGTAAAGTTTATAAAATTAAAGGAAGAGCCGATGCAATAATTAGAAAT
GACAACGGGAAGAGTATTGTAATTGAGATAAAAACTTCTAGAAGTGATAAAGGATTACCTCTAATTCATC
ATAAAATGCAGCTACAGATATATTTATGGTTATTTAGTGCAGAAAAAGGTATACTAGTTTACATAACTCC
AGATAGGATAGCTGAGTATGAAATAAACGAACCTTTAGATGAAGCAACAATAGTAAGACTTGCAGAGGAT
ACAATAATGTTACAAAACTCACCTAGATTCAACTGGGAATGTAAATATTGCATATTTTCCGTCATTTGCC
CAGCTAAACTAACCTAAAATTAAAATCTCTCATCGATATAATTAAATTGTGCACACTAGACCAGTAGTTG
CCACAATAGCTGGGAGTGACAGTGGAGGAGGTGCTGGATTACAGGCTGATCTAAAGACGTTTAGCGCATT
AGGAGTTTTTGGTACAACAATAATAACCGGTTTAACAGCACAGAATACAAGAACAGTTACAAAAGTATTA
GAGATACCATTAGATTTCATTGAAGCTCAGTTTGATGCGGTTTGCCTAGATTTACATCCAACTCACGCCA
AAACTGGAATGTTAGCTTCTGGTAAAGTGGTAGAACTTGTACTGAGAAAAATTAGAGAGTATAACATAAA
ACTAGTTTTAGATCCAGTGATGGTTGCGAAATCTGGATCATTATTGGTAACAGAGGATATCTCGGAGCAA
ATAAAAAAGGCGATGAAGGAGGCCATAATATCTACTCCAAACAGATATGAAGCTGAGATAATAAATAAGA
CAAAGATTAATAGTCAAGATGATGTTATAAAAGCGGCAAGGGAAATTTATTCTAAGTATGGGAATGTTGT
AGTTAAAGGATTTAATGGAGTAGATTACGCCATAATTGACGGAGAAGAAATAGAGTTAAAAGGTGATTAC
ATCAGTACTAAAAATACACATGGTAGTGGAGACGTATTTTCTGCCTCCATAACTGCATATCTTGCCTTGG
GATACAAACTTAAAGATGCATTAATAAGAGCTAAAAAATTCGCTACAATGACAGTCAAATACGGTTTGGA
CTTAGGAGGAGGATATGGACCAGTAGATCCCTTTGCCCCTATAGAGTCCATAGTGAAGAGAGAAGAAGGA
AGAAATCAGCTAGAAAACTTACTTTGGTACTTAGAGTCTAATCTTAACGTTATACTTAAACTAATTAACG
Can you spot the gene?!
/ (ATG|TTG|GTG)(()*?)(TGA|TAG|TAA)/
Bacterial gene nding: relatively easy!
!! Dense Genomes!
!! Short intergenic regions!
!! Uninterrupted ORFs!
!! Conserved signals!
!! Abundant comparative information!
!! Complete genomes!
Chuck Stuben, 2005
Indicators of protein coding regions in bacterial DNA!
!! Intrinsic evidence!
"! Sufcient ORF length. Long ORFs rarely occur by chance.!
"! Specic patterns of codon usage that are different from triplet frequencies in
non-coding regions ("coding potential").!
"! The presence of ribosome binding sites (RBS) in the (-20)...(-1) region
upstream of the start codon that help to direct ribosomes to the correct
translation start positions. A part of the RBS is formed by the purine-rich
Shine-Dalgarno (SD) sequence which is complementary to the 3 end of the
16S rRNA. !
!! Extrinsic evidence!
"! Similarity to known, especially experimentally characterized, gene products!
Gene-Finding Strategies!
Intrinsic approaches!
!! GeneMark (Borodovsky & McIninch, 1993)!
"! non-homogeneous Markov models for DNA regions that code for proteins or are complementary to
them!
"! homogeneous Markov models for non-coding regions !
"! coding capacity of sliding windows is deduced through A Bayesian decision rule!
!! GLIMMER (Salzberg et al., 1998)!
"! interpolated Markov models!
"! takes into account DNA oligomers of varying length dependent on the local composition of the
sequence!
!! EcoParse (Krohg et al., 1994)!
"! hidden Markov models!
"! nds the maximum likelihood parse of a DNA sequence into coding and non-coding regions !
"! no sliding windows used!
!! Hatzigeorgiou and Fickett !
"! A program for gene recognition relying solely on ORF length and RBS!
!! FgenesB (Softberry, Inc.)!
Figure 5.1 Genomes 3 ( Garland Science 2007)
Six Frames in a DNA Sequence
!! stop codons !
!! start codons!
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG
CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG
CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
www.bioalgorithms.info
Figure 5.2 Genomes 3 ( Garland Science 2007)
http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/
wprintgc.cgi?mode=c!
Open Reading Frames (ORFs)
!! Detect potential coding regions by looking at ORFs!
"! A genome of length nis comprised of (n/3) codons!
"! Stop codons break genome into segments between consecutive Stop codons!
"! The subsegments of these that start from the Start codon (ATG) are ORFs!
#! ORFs in different frames may overlap
3
n
3
n
3
n
Genomic Sequence
Open reading frame
ATG TGA
www.bioalgorithms.info
Long vs.Short ORFs
!! Long open reading frames may be a gene
"! At random, we should expect one stop codon every
(64/3) ~= 21 codons
"! However, genes are usually much longer than this
!! A basic approach is to scan for ORFs whose length
exceeds certain threshold
"! This is nave because some genes (e.g. some neural
and immune system genes) are relatively short
www.bioalgorithms.info
Open reading frames of "100 bp encoded on a 10-kb fragment of the Escherichia coli
K12 genome from 3435250 to 3445250!
The six horizontal lines represent frames 1, 2, 3, !1, !2, and !3, respectively.
Koonin & Galperin, 2003
Russell F. Doolittle, 2002
Estimated over-annotation of genes in sequenced
genomes!
Skovgaard et al., 2001
Protein length distributions for Escherichia coli
Skovgaard et al., 2001
Protein length distributions for Aeropyrum
pernix
Skovgaard et al., 2001
Besemer et al., 2001
Intrinsic approach!
Statictical properties of protein coding regions!
!! Factors that contribute to the unequal usage of codons in a
coding sequence:!
"! Unequal use of amino acids!
"! Unequal number of codons for different amino acids!
"! Codon preference!
!! If reading frame 1 encodes a protein, it will inuence the
following factors:!
"! The amino acid composition in both the coding frame and the other two
frames (frames 2 and 3)!
"! The codon composition of all three frames!
"! Positional base frequency: the frequency with which each of the four
bases occupies each of the three positions within codons!
Staden, 1990
The Genetic Code!
SwissProt protein sequence database!
Staden, 1990
Staden, 1990
Staden, 1990
Examining positional base frequency!
!! For each base measure the difference in its abundance in each
position!
"! E.g., relative abundance of base T in positions 1,2,3 of codons!
"! The greater difference, the more likely that the sequence is coding!
"! Uneven positional base frequencies method!
!! Study preferences for certain bases to occupy particular positions
in codons!
"! Compare the observed positional base frequencies with an expected
distribution!
"! Positional base preference method !
Staden, 1990
Uneven positional base frequencies method
(Fickett, 1982)!
!! N
ij
: the number of times base i occurs
in a position jof a codon!
!! E
ij
=(N
i1
+N
i2
+N
i3
): expected value for
each base in each position of a codon,
assuming an even distribution!
!! D=!|E
ij
-
N
ij
| : divergence from even
positional usage
!
!! Slide a window along the sequence,
one codon at a time!
!! Calculate D for each position in which
a window is placed!
Coding
Noncoding
EMBL database, 1984
Staden, 1990
Staden, 1990
Codon usage table!
Source: http://www.kazusa.or.jp/codon/
Source: http://www.kazusa.or.jp/codon/
TESTCODE (Fickett, 1982)!
!! In protein coding regions every third base tends to be the same one much more often than by chance alone due
to non-random use of codons!
!! Method:!
" Count the number of each base at every third position starting at positions 1,2, and 3, and going to the end of the sequence
window!
" The assymetry statistic for each base is calculated as the ratio of the maximum count of the three possible reading frames
devided by the minimum count for the same base plus 1!
" The frequency of each base is the window is also calcutated!
" The resulting assymetry and frequency scores are converted to probabilities of being found in a coding region!
Plot of the nonrandomness of every third base in the sequence
Coding
Noncoding
D.Mount, 2001
HMM for gene nding!
!! HMMs are able to model grammar!
!! Many problems in biological sequence analysis have a
grammatical structure!
!! Example: the grammar of eukaryotic gene structure (simplied)!
"! Exons and introns are words!
"! Sentences: exon-intron-exon-intron..intron-exon!
"! Sentences can never end with an intron!
"! Exon can never follow an exon without an intron in between!
"! Other constraints!
Krogh, 1998
HMM architecture for a parser for E.coliDNA with a simple
intergenic model!
Central state
!Generates no nucleotides
!Connects all models
Main states
Insertions
Deletions
Generate no nucleotides
The thickness of the arrows
indicates the fraction of sequences
making a given transition.
Insert state
!Produces random sequences from the
base distribution of intergenic regions in
the training set
Krogh, 1994
How HMM generates a sequence of nucleotides!
!! Process: random walk starting in the middle of any of the HMMs. !
"! begin at any state (e.g. central) and enter any of the rings!
"! each such state transition has an associated probability!
"! transitions out of the central state are chosen at random according to these probabilities (they sum to one)!
#! Example: a transition leading to the AAC codon model HMM generates the three nucleotides AAC with very high
probability and then, with probability 1, makes the transition back to the central state.!
"! subsequently, a new transition out of the central state is selected randomly and independently of the
previous transition!
!! Choosing one of the 61 codon models repeatedly results in a 'random gene!
!! Gene termination!
"! entry into one of the rings below the central state!
"! low probability (roughly: the number of intergenic regions divided by the number of codons in a typical
contig of E.coli DNA.)!
"! one stop codon HMM generates both TAA and TGA, each according to its frequency of occurrence in E.
coli, and the other TAG!
!! Intergenic region!
"! produced independently and at random by looping in the state labelled 'Intergene model!
!! Start codon HMM!
"! generates either ATG, GTG or TTG, each with the appropriate probability (TTG is very rare in E. coli)!
!! A transition is made back to the central state and the whole process repeated !
!! Result: a sequence of nucleotides that is statistically similar to a contig of E. coliDNA consisting
of a collection of genes interspersed with intergenic regions.!
Krogh, 1994
Finding the most likely random walk!
!! A dynamic programming method known as the Viterbi algorithm
!! Generates a parse of the contig: labels genes in the DNA by
identifying portions of the path that begin with the start codon at
the end of the intergenic ring, pass through several amino acid
codon HMMs, and return to one of the stop codons at the
beginning of the intergenic ring!
!! The model parses a gene in one direction only and thus nds all
genes on the direct strand!
!! To locate genes on the opposite strand, the reverse complement
is parsed!
Krogh, 1994
The gene model!
!! The role of the codon HMMs is similar to codon usage statistics!
!! Simplication: the codons in a gene are random and independent!
!! The probability that a region is coding is simply the product of the probabilities of
the individual codons!
!! The probability of an open reading frame (ORF) consisting of codons c
l
, c
2
,...c
k
and
excluding start and stop codons is!
Codon probability
ORF gene index
Contig length
Krogh, 1994
Krogh, 1994
Gene index in E.coli
!! Typical genes have I<1!
!! Average(I) = 0.935!
!! 16% greater than 0.96!
!! 7% greater than 0.98!
!! 2% greater than 1.0!
!! Used to rank predictions and
resolve ambiguities!
Krogh, 1994
A parser with a complex intergenic model!
Krogh, 1994
Traning data!
Krogh, 1994
Krogh, 1994
Modeling inter-codon dependencies:
GeneMark!
!! Fifth-order Markov model (Lukashin and Borodovsky
1998)!
"! uses sequence information from the previous ve bases!
"! the frequency of hexamers is used to differentiate between
coding and noncoding sequences!
"! limitation: there must be many representatives of each
hexameric sequence in genes!
D.Mount, 2001
GeneMark!
!! Uses a Markov chain model to represent the statistics of coding and noncoding reading
frames!
!! Uses dicodon statistics to identify coding regions!
! Fifth-order Markov chains consist of terms P(a|x
1
x
2
x
3
x
4
x
5
) which represent the probability of the sixth base of sequence x being
a given that the previous five bases were x
1
x
2
x
3
x
4
x
5
.
! These terms must be defined for all possible pentamers with the general sequence b
1
b
2
b
3
b
4
b
5
(training data)
! When there are sufficient data for good statistics:

P(ab
1
b
2
b
3
b
4
b
5
)=
n
b
1
b
2
b
3
b
4
b
5
a
n
b
1
b
2
b
3
b
4
b
5
a
a=A,C,G,T
"
Where

n
b
1
b
2
b
3
b
4
b
5
ais the number of times the sequence b
1
b
2
b
3
b
4
b
5
a occurs in the training data
! GeneMark assumes each coding reading frame has unique dicodon statistics, labeled P
1
(a|b
1
b
2
b
3
b
4
b
5
), P
2
(a|b
1
b
2
b
3
b
4
b
5
), P
6
(a|
b
1
b
2
b
3
b
4
b
5
). Note: reading frames are identified by the codon position of the last base of the pentamer!
! Non-coding regions are assumed to have the same statistics in all six reading frames P
nc
(a|b
1
b
2
b
3
b
4
b
5
).
! These sets of parameters form seven distinct models, which can be used to estimate the likelihood that any given sequence is
coding or noncoding, and if coding, which reading frame is involved.
The probability of obtaining a sequence x=x
1
x
2
x
3
x
4
x
5
x
6
x
7
x
8
x
9
in a coding region in the translated reading frame 3 (i.e. x
1
x
2
x
3
is a
translated codon):
P(x|3) = P
2
(x
1
x
2
x
3
x
4
x
5
)P
2
(x
6
|x
1
x
2
x
3
x
4
x
5
)P
3
(x
7
|x
2
x
3
x
4
x
5
x
6
)P
1
(x
8
|x
3
x
4
x
5
x
6
x
7
)P
2
(x
9
|x
4
x
5
x
6
x
7
x
8
)
where the first term is the probability of finding the pentamer x
1
x
2
x
3
x
4
x
5
in the coding frame 2, which will result in x
9
being in
reading frame 3.
Note the cycling of successive terms through P
1
(a|), P
2
(a|), and P
3
(a|). Such models are called periodic, phased, or
inhomogeneous Markov models.
The probability of obtaining the sequence x on the complementary strand to the coding sequence can be obtained using cycling
terms P
4
(a|), P
5
(a|), and P
6
(a|).
Using the Bayes formula, the likelihood that the segment of sequence x is in coding frame 3:

P3x
( )
=
P(x3)P3 ()
P(xnc)Pnc ( )+ P(xm)Pm ()
m=1
6
"
Where P(3) and P(nc) are the a priori probabilities of the coding frame 3 and noncoding models, respectively, and m is the index
of the coding frame.
In GeneMark P(nc) the a priori probability of the sequence being noncoding was assumed to be !, and P(1)-P(6) assumed all
to be 1/12. Sliding windows of 96 nucleotides were scored in steps of 12 nucleotides. If P(i|x), where i=1,6 exceeds a chosen
threshold, the window is predicted to be in coding frame i.
Different types of markov models that output nucleotide sequence!
Homogeneous fifth-order Markov model, with
the five states i-5 to i-1 generating state i. Each
state corresponds to a nucleotide.
Three-periodic fifth-order Markov models, each
modeling a different DNA reading frame. The
probabilities are dependent on the position of the
base within the codon. Each state is labeled with
the codon position of the represented base
GeneMark text output!
GeneMark: graphical output!
Gene classes in Escherichia coli
!! Three classes: class I, II, and III!
"! differ not only in the statistical but in the biological sense!
!! Class I genes!
"! intermediate codon usage bias!
"! maintain a low or intermediate level of expression,!
"! some genes may occasionally be expressed at a very high level in environmentally triggered
(rare) conditions!
!! Class II genes!
"! high codon usage bias!
"! highly expressed under exponential growth conditions!
!! Class III genes,!
"! low codon usage bias!
"! mainly belong to plasmids and insertion sequences!
"! also includes genes coding for mbriae, major pili, many membrane proteins, restriction
endonucleases and lambdoid phage lysogeny control proteins!
"! can be expressed at a fairly high level!
Mdigue et al., 1991
Philippe & Douady, 2003
Dealing with different gene pools!
!! Protein-coding sequences in a bacterial genome may not be homogeneous in their
compositional features!
!! Program using models trained on the bulk set of protein-coding sequences will be
insensitive in nding genes of minor inhomogeneity classes!
!! If preliminary information on gene classes is available, class-specic models of
protein- coding regions improve the performance of the gene-nding method!
!! Possible solution: !
"! use a set of long ORFs to obtain parameters of Markov models of protein-coding and
noncoding regions !
"! initial models used to score the putative gene sequences and to form the cluster seeds for the
class-specic training sets.!
"! run clusterization procedure until convergence!
"! obtain several sets of presumably coding sequences with more homogeneous compositional
features!
Hayes & Borodovsky, 1998
Hidden Markov model of a prokaryotic nucleotide sequence
used in the GeneMark.hmm algorithm!
Lukashin and Mark Borodovsky, 1998
Modeling inter-codon dependencies:
Glimmer!
!! Interpolated Markov model (Salzberg et al. 1998)!
"! nds a sufcient number of patterns by searching for the longest possible
patterns that are represented in the known gene sequences up to a length
of eight bases. !
"! if there are not enough hexameric sequences, then pentamers or smaller
may be more highly represented!
"! in other cases many representative patterns even longer than six bases
may be found !
"! the longer the patterns, the more accurate the prediction!
"! combines probability estimates from the different-sized patterns, giving
emphasis to longer patterns and weighting more heavily the patterns that
are well represented in the training sequences!
A case for the extrinsic approach!
!! In real life putative coding regions predicted by
intrinsic methods are veried by similarity searches!
!! 60-80% of genes in newly sequenced organisms have
known counterparts in other species. In at least 30% of
the cases reliable global alignments with well
characterized proteins can be obtained!
ORPHEUS: Gene prediction in bacterial
genomes!
!! Incorporation of extrinsic and intrinsic information!
!! Consideration of coding potential and ribosome-
binding sites!
!! High accuracy!
!! Precise identication of gene starts!
!! Completely automatic prediction!
Frishman et al.,1998
Outline of the algorithm!
Step I Step II Step III Step IV
High-stringency
DNA-protein search
(DPS, Huang, 1996)
DPS local
alignments
Contig sequence
Extend local alignments
Upstream: first start codon
Downstream: stop codon
Codon usage
Similarity-based
seed ORFs
Contig sequence
DPS alignments
Extend local alignments
Upstream: first start codon
Downstream: stop codon
Similarity-based and
ab initio ORFs with a
single possible start
Find other ORFs
with high coding potential
Alignment of -20-1 regions
upstream from the start codon
RBS detection
RBS weight matrix
Contig sequence
DPS alignments
Codon usage
Extend local alignments
Upstream: first start codon
Downstream: stop codon
Contig sequence
DPS alignments
Codon usage
RBS weight matrix
Find other ORFs
with high coding potential
Full set of
predicted genes
Extend all ORFs upstream
till the best start is found
Example DPS output!
Outline of the algorithm!
Step I Step II Step III Step IV
High-stringency
DNA-protein search
(DPS, Huang, 1996)
DPS local
alignments
Contig sequence
Extend local alignments
Upstream: first start codon
Downstream: stop codon
Codon usage
Similarity-based
seed ORFs
Contig sequence
DPS alignments
Extend local alignments
Upstream: first start codon
Downstream: stop codon
Similarity-based and
ab initio ORFs with a
single possible start
Find other ORFs
with high coding potential
Alignment of -20-1 regions
upstream from the start codon
RBS detection
RBS weight matrix
Contig sequence
DPS alignments
Codon usage
Extend local alignments
Upstream: first start codon
Downstream: stop codon
Contig sequence
DPS alignments
Codon usage
RBS weight matrix
Find other ORFs
with high coding potential
Full set of
predicted genes
Extend all ORFs upstream
till the best start is found
Calculation of the coding potential!
Primary coding potential
W(abc) = log F(abc), where F(abc) is its genomic frequency
Statistical weight of the codon abc
Normalized coding potential
Average codon weight
Standard deviation
Coding quality of a DNA fragment
G(b) genomic frequency of the base b
To avoid the influence of the local base composition and gene shadows


Q(a
1
b
1
c
1
a
n
b
n
c
n
)= W(a
k
b
k
c
k
)
k=1
n
"

R=
Q"n
# n

= G(b
1
)G(b
2
)G(b
3
)W(b
1
b
2
b
3
)
b
1
b
2
b
3
=AAA
TTT
"

"
2
= G(b
1
)G(b
2
)G(b
3
)[W(b
1
b
2
b
3
)#]
2
b
1
b
2
b
3
=AAA
TTT
$


"(a
1
b
1
c
1
a
n
b
n
c
n
)=R(a
1
b
1
c
1
a
n
b
n
c
n
)#max{R(c
0
a
1
b
1
c
n#1
a
n
b
n
),R(b
1
c
1
a
2
b
n
c
n
a
n+1
)}
Outline of the algorithm!
Step I Step II Step III Step IV
High-stringency
DNA-protein search
(DPS, Huang, 1996)
DPS local
alignments
Contig sequence
Extend local alignments
Upstream: first start codon
Downstream: stop codon
Codon usage
Similarity-based
seed ORFs
Contig sequence
DPS alignments
Extend local alignments
Upstream: first start codon
Downstream: stop codon
Similarity-based and
ab initio ORFs with a
single possible start
Find other ORFs
with high coding potential
Alignment of -20-1 regions
upstream from the start codon
RBS detection
RBS weight matrix
Contig sequence
DPS alignments
Codon usage
Extend local alignments
Upstream: first start codon
Downstream: stop codon
Contig sequence
DPS alignments
Codon usage
RBS weight matrix
Find other ORFs
with high coding potential
Full set of
predicted genes
Extend all ORFs upstream
till the best start is found
Alignment of the -20-1 gene regions!
1 AGAAAACGACAAGGAAAGGTATTTG
2 ATCTGAAGGGGGATTTTGGAGAATG
3 GTGAAAAATTGGAGGGAAACTCATG
4 CAATTAGAGAGGAGAATTCGATATG
5 AAAGCGAAGGACTCGGCGAGTAATG
6 ACATACCCTGCAAGGATGATTAATG
7 AAAAGGAAGGGAGGTCTATCTCATG
8 TTAAATATGGTGGTGGAAACAGATG
9 TTTCGAAAGGATTGTTTATAAAATG
10 GACTGAATATTTAAACAATTATGTG
11 TAAATAGAAGGAGGCGCACAAAATG

...

376 ACCGTTATGGAGGGATACATAAATG
377 TAACTGAATTTAAAGGAGGTTCATG
378 CGTTTAAAATGCATAATAAGGAGTG
379 TGGAAAACAGGGAGAGATCATAATG
380 TGCAAACTAACGGGGGGATAATATG
381 AAAAGAAAAAGGAGATGGGAGTATG
382 ATACAAGGTCTTTCGGGAGGCCTTG
383 ATCATGATCGGTCATATTTTAGATG
384 CGAATGTAAACATGTAGCAAGGGTG
385 AAATTCGGGAGAGTGAAGCGAGATG
386 ACAAACAGAATTCAGGTGAGACATG
L is the expected length of the Shine-Dalgarno box
Positional information content
F(b,j) - positional nucleotide frequencies
in the initial alignment
G(B) - genomic frequency of the base b
Initial position of the RBS signal

H(j)= F(b,j)logF(b,j)G(b)
( )
b=A
T
"

H(k)
j="20..."L
# $ # # # # # # max
k=j
j+L"1
%
The RBS weight matrix!
Positional nucleotide weights
Where maxN(B,k) is the frequency of the consensus nucleotide

W(b,k)=log
N(b,k)+0.5
max
b
N(b,k)+0.5
"
#
$
$
$
%
&
'
'
'
Nucleotide Nucleotide position in the window
1 2 3 4 5 6
A 0.000 -0.909 -0.845 0.000 -0.909 -0.511
C -0.856 -1.000 -0.943 -0.923 -0.999 -0.748
G -0.804 0.000 0.000 -0.709 0.000 0.000
T -0.765 -0.897 -0.980 -0.868 -0.962 0.725
Consensus A G G A G G

RBS weight matrix for B.subtilis!
Nucleotide Nucleotide position in the window
1 2 3 4 5 6
A 0.000 -0.909 -0.845 0.000 -0.909 -0.511
C -0.856 -1.000 -0.943 -0.923 -0.999 -0.748
G -0.804 0.000 0.000 -0.709 0.000 0.000
T -0.765 -0.897 -0.980 -0.868 -0.962 0.725
Consensus A G G A G G

Finding the RBS positions in each sequence!
The Shine-Dalgarno signal score
1 AGAAAACGACAAGGAAAGGTATTTG -1.763
2 ATCTGAAGGGGGATTTTGGAGAATG -0.965
3 GTGAAAAATTGGAGGGAAACTCATG -0.765
4 CAATTAGAGAGGAGAATTCGATATG -0.511
5 AAAGCGAAGGACTCGGCGAGTAATG -1.884
6 ACATACCCTGCAAGGATGATTAATG -1.264
7 AAAAGGAAGGGAGGTCTATCTCATG -0.914
8 TTAAATATGGTGGTGGAAACAGATG -1.793
9 TTTCGAAAGGATTGTTTATAAAATG -1.847
10 GACTGAATATTTAAACAATTATGTG ------
11 TAAATAGAAGGAGGCGCACAAAATG -0.110

...

376 ACCGTTATGGAGGGATACATAAATG -0.925
377 TAACTGAATTTAAAGGAGGTTCATG -0.325
378 CGTTTAAAATGCATAATAAGGAGTG -2.061
379 TGGAAAACAGGGAGAGATCATAATG -1.315
380 TGCAAACTAACGGGGGGATAATATG -1.710
381 AAAAGAAAAAGGAGATGGGAGTATG -0.511
382 ATACAAGGTCTTTCGGGAGGCCTTG -1.159
383 ATCATGATCGGTCATATTTTAGATG ------
384 CGAATGTAAACATGTAGCAAGGGTG -2.171
385 AAATTCGGGAGAGTGAAGCGAGATG -1.570
386 ACAAACAGAATTCAGGTGAGACATG -1.704


"(b
1
b
L
)= W(b
k
,k)
k=1
L
#
Automatically derived positional preference of the Shine-Dalgarno box in B.subtilis
-0,30
-0,25
-0,20
-0,15
-0,10
-0,05
0,00
-
2
2
-
2
1
-
2
0
-
1
9
-
1
8
-
1
7
-
1
6
-
1
5
-
1
4
-
1
3
-
1
2
-
1
1
-
1
0
-
9
-
8
-
7
Window position relative to the start codon
P
o
s
i
t
i
o
n

w
e
i
g
h
t
Incorporation of SD positional preference !
Positional weights
M - a possible position of the SD box within the RBS region
N(M) the number of time this position occurs
N
max -
the count of the most frequent position.
The strength of the RBS signal

V(M)=log
N(M)+0.5
N
max
+0.5
"
#
$
$
%
&
'
'


"(b
1
b
L
)= W(b
k
,k)
k=1
L
#
+V(M)
Outline of the algorithm!
Step I Step II Step III Step IV
High-stringency
DNA-protein search
(DPS, Huang, 1996)
DPS local
alignments
Contig sequence
Extend local alignments
Upstream: first start codon
Downstream: stop codon
Codon usage
Similarity-based
seed ORFs
Contig sequence
DPS alignments
Extend local alignments
Upstream: first start codon
Downstream: stop codon
Similarity-based and
ab initio ORFs with a
single possible start
Find other ORFs
with high coding potential
Alignment of -20-1 regions
upstream from the start codon
RBS detection
RBS weight matrix
Contig sequence
DPS alignments
Codon usage
Extend local alignments
Upstream: first start codon
Downstream: stop codon
Contig sequence
DPS alignments
Codon usage
RBS weight matrix
Find other ORFs
with high coding potential
Full set of
predicted genes
Extend all ORFs upstream
till the best start is found
Start codon selection procedure!
B.subtiliscompetence gene comA (Weinrauch et al., 1989)
Start codon position
in the genome
RBS signal
strength ()
Coding potential
quality ()
Putative translation initiation region
3252280 -4.171 2.090 gaatcctcattgtaaaattatcGTG
3252331 -3.423 1.031 tctaggcggcgaggtcaatgggATG
3252364 -3.684 0.927 ctcgtcatatgatctcattttaATG
3252445 -2.388 0.484 aattttggaaacggattcgaatTTG
3252463 -3.738 0.312 catggaaggcaccaagacaattTTG
3252484 -3.950 0.777 gattgatgaccatccggctgtcATG
3252508 -2.706 0.780 ggaaaacatgaaaaagatactaGTG
3252523 -0.718 0.844 agtgagtaaaagggaggaaaacATG
3252544 -2.502 0.698 cttttttataaaatggaaaagaGTG

Fragment of the C.trachomatis genome with genes predicted by similarity (red)
and ab initio(blue) !
Comparison of the gene prediction results with the sets of sequences from the PIR-International and the
genome sequencing projects!
Percentage of correctly identified
genes
(true positives)
Percentage of correct starts for
correctly identified genes
Percentage of correctly predicted genes
with correct starts using "leftmost
ATG" procedure
Dataset
Length>100 Length > 35 Length>100 Length > 35 Length>100 Length > 35
SUBPIR 93.3% - 80.2% - 61% -
ECOPIR 96.3% - 65.8% - 68.5% -
SUBGEN 98.9% 88.9% 79.1% 69.9% 64.4% 58.9%
ECOGEN 99.1% 87.1% 70.0% 61.3% 78.2% 71.7%

Organism GC
content,%
Status Gene prediction methodInformation content of
the RBS alignment
Helicobacter pylori 39 C GeneMark/GeneSmith 4.35
Campylobacter jejuni 31 P - 3.67
Enterococcus faecalis 36 P - 3.6
Bacillus subtilis 43.5 C GeneMark/additional
searches for RBS signals
3.47
Streptococcus pyogenes 36 P - 3.33
Streptococcus pneumoniae 40 P - 3.24
Aquifex aeolicus 43.4 C Similarity/Critica 3.13
M. thermoautotrophicum 49.5 C GenomeBrowser 3.11
Haemophilus influenzae 38 C GeneMark 3.06
Methanococcus jannaschii 31.4 C GeneMark 3.00
Pyrococcus horikoshii OT3 42 C Simple ORF extraction
and similarity analysis
2.93
Chlamydia trachomatis 42 P - 2.84
Neisseria gonorrhoeae 52 P - 2.73
Pyrococcus furiosus 41 P - 2.73
Archaeoglobus fulgidus 48.5 C GeneSmith/Critica 2.72
Escherichia coli 50.8 C Manual analysis 2.55
Neisseria meningitidis 52 P - 2.62
A. actinomycetemcomitans 44 P - 2.50
Treponema pallidum 52.8 C Glimmer 2.32
Synechocystis sp. 48 C ??? 2.26
Deinococcus radiodurans 66 P - 2.2
Yersinia pestis 48 P - 2.18
Borrelia burgdorferi 28.6 C Glimmer 2.08
Pseudomonas aeruginosa 67 P - 2.08
Mycobacterium leprae 58 P - 2.03
Streptomyces coelicolor 72 P - 1.98
Mycobacterium tuberculosis 65.6 C TbParse 1.93
Dependence of the gene start prediction
accuracy (selected genes)!
40,0
50,0
60,0
70,0
80,0
90,0
100,0
1,5 2,0 2,5 3,0 3,5 4,0
Information content of the RBS alignment
C
o
r
r
e
c
t
l
y

p
r
e
d
i
c
t
e
d

g
e
n
e

s
t
a
r
t
s
,

%
M.tuberculosis
P.aeruginosa
Synechosystis sp.
N.gonorrhae
E.coli
H.influenzae
B.subtilis
S.pneumoniae
M.thermoautotrophicum
Comparison of the experimentally
determined and predicted RBS in C.jejuni!
RBS Spacer Start codon
ttAGGAt 5 TTG
aAAGGct 3 ATG
tAAGGAt 5 ATG
cAAGGAt 6 ATG
tAAGGAg 7 ATG
aAAGGAc 5 ATG
aAAGGAa 4 ATG
cgAGGAc 2 ATG
AAGGca 3 ATG
agAGGAg 6 GTG
aAAGGAa 6 ATG
aAAGGAg 5 ATG
aAAGGAg 5 ATG
ttAGGtg 6 ATG
tcAGGAg 7 ATG
tAAGGAa 6 ATG
cAAGGAg 5 GTG
aAAGGAt 5 ATG
ctAGcAc 5 ATG
aAAGGAg 5 ATG
aAAGGta 9 ATG
RBS Spacer Start codon
AAAGGA 6 ATG
ACATGA 13 TTG
AAAGAA 6 ATG
AAAGGA 7 ATG
AAAGGA 6 ATG
AAAGGT 5 ATG
AAAGGA 8 ATG
AAAGGA 6 ATG
AAAGGA 6 ATG
TTAGGA 7 ATG
AAAGGT 9 ATG
AAAGGA 5 ATG
AAAGGA 6 TTG
AAAGGA 6 ATG
AAAGAA 16 ATG
AAAAGA 10 GTG
AAAGAA 6 TTG
AAAGGA 7 ATG
AGAGGA 3 ATG
AAAGGA 6 ATG
CAAAGA 1 ATG
AAAGGA 6 ATG
CAAGGG 6 ATG
ATAAGA 6 TTG
AGAAGA 13 TTG
AAAGGA 6 ATG
TTAGGA 6 ATG
CAAGGA 7 ATG
GAAGGA 5 ATG
AAAGGA 2 ATG
AAGAGA 9 ATG
AAACAA 3 GTG
AAAAGA 13 ATG
GAAGGT 16 GTG
Orpheus
Wstenet al. (1998)
The C.jejuniRBS weight matrix!
Position in the
window
1 2 3 4 5 6
A 0 0 0 -0.626 -0-806 0
C -0.788 -0.916 -1.079 -0.871 -1.017 -1.017
G -0.840 -0.862 -1.022 0 0 -0.960
T -0.840 -0.811 -1.079 -0.871 -1.017 -0.854
Consensus A A A G G A
Dependence of the gene start prediction accuracy on the RBS quality
(complete genomes)!
40,0
50,0
60,0
70,0
80,0
90,0
100,0
1,5 2 2,5 3 3,5 4
Information content of the RBS alignment
C
o
r
r
e
c
t
l
y

p
r
e
d
i
c
t
e
d

g
e
n
e

s
t
a
r
t
s
,

%
B.subtilis
A.aeolicus
M.thermoautotropicum
H.influenzae
M.jannaschii
P. horikishii
C.trachomatis
A.fulgidis
E.coli
T.pallidum
Synechosystis sp.
B.burgdorferi
M.tuberculosis

You might also like