You are on page 1of 6

Mining Sequence

Patterns in
Biological data
Bioinformatics

Applies Computer Technology in Molecular


biology

Develops algorithms and methods to manage


and analyze biological data
Effective methods are needed to compare
and align biological sequences and discover
sequential patterns

Type of data

DNA: helix-shaped molecule whose


constituents are two parallel strands of
nucleotides : Adenine (A), Cytosine (C), Guanine
(G), Thymine (T)

Proteins: Composed of 20 amino acids

Produced from DNA using 3 operations or transformations:


transcription, splicing and translation

Gene : Sequence of hundreds of individual


nucleotides arranged in a particular order
Genome : Complete set of genes of an
organism

Alignment of Biological
Sequences

Alignment given two or more input biological


sequences, identify similar sequences with long
conserved sub-sequences

Pair-wise Sequence alignment


Multiple Sequence Alignment
In nucleotides two symbols align if they are identical
In amino acids they align if identical / or one can be
derived from the other
Local Alignment Vs Global Alignment
Substitution matrix represent probability of substitution

Alignment score can be calculated

Need for alignment

Two sequences are homologous if they


share the same ancestor
Degree of similarity helps to determine
degree of homology

Helps to construct evolution tree or phylogenetic tree

Pairwise Alignment
Pairwise Alignment

Needleman-Wunsch Algorithm
Smith-Waterman Algorithm

Build up Optimal Sequences


Use Dynamic Programming
O(n2) Time Complexity

Dot matrix plot

Uses boolean matrices to represent alignments

that can be detected visually


2
O(n ) Time Complexity

Heuristic Algorithms

BLAST Basic Local Alignment Search Tool


FASTA Fast Alignment Tool
First locate high-scoring short stretches and extend
them

BLAST Local Alignment


Algorithm

Finds regions of local similarity between biosequences


Matches nucleotide / protein sequences to
sequence databases and calculates statistical
significance of matches
Breaks the sequences to be compared into
sequences of fragments (words) and seeks
matches between words

DNA word size 11 bases


Amino Acids 3 amino acids
Creates a hash table of matching words
Moves from exact matches to neighborhood words
Due to hashing O(n)

Variants : MEGABLAST (long alignments),


Discontinuous MEGABLAST (gapped
alignments- similar not identical), BLASTN
(Adjustable word size), BLASTP

Multiple Sequence
Alignment Methods

Goal To find common patterns among all


considered sequences
Applications

More complex than Pair wise alignment

To build gene / protein families


Identify amino acids which are essential sites
for structure and function
Multi-dimensional alignment / Approximate
alignment

Methods

Series of pair-wise alignments

Feng-Doolittle alignment

Computes all possible pair wise alignments by


dynamic programming
Constructs a Guide tree by clustering and
progressive alignment

Multiple Sequence alignment

Hidden Markov Models

HMM for Biological


Sequence Analysis

Finding CpG Islands

Methylation process converts


C in CpG to T

CpG occurrence rare


Methylation is suppressed around
start regions of genes
Areas with high concentration
CpG Islands

Given a short sequence is it


from a CpG island
Given a long sequence can
all CpG islands be found

Markov Chain

Probability of a symbol depends only on


previous symbol
Markov Chain model states and
transitions (probability)
Probability of a sequence x = x1x2xL

Hidden Markov Model

Used to find all CpG islands in a long DNA


Sequence
Merge two Markov chains and add transition
probabilities between the two states
Hidden Markov Model: states, transitions,
emission probabilities (probability of producing a
symbol at a state)

Hidden because the states visited in generating a


sequence are not known

Hidden Markov Models

Tasks

Evaluation: Given a sequence x

determine probability P(x) Forward


Algorithm
Decoding: Given a sequence, determine
most probable path through the model
Viterbi Algorithm
Learning: Given a model and training
sequences, find the model parameters
Baum Welch Algorithm

You might also like