You are on page 1of 21

Multiple Sequence Alignment

CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Why do we care about sequence alignment?

It can tell us something about the evolution of organisms. We can see which regions of a gene (or its derived protein) are susceptible to mutation and which can have one residue replaced by another without changing function. Homologous genes (genes with share evolutionary origin) have similar sequences. Orthologs are genes that are evolutionarily related, have a similar function, but now appear in different species. Paralogs are evolutionarily related (share an origin) but no longer have the same function. You can uncover either orthologs or paralogs through sequence alignment.

Multiple Sequence Alignment


Often applied to proteins Proteins that are similar in sequence are often similar in structure and function Sequence changes more rapidly in evolution than does structure and function.

Overview of Methods

Dynamic programming too computationally expensive to do a complete search; uses heuristics Progressive starts with pair-wise alignment of most similar sequences; adds to that Iterative make an initial alignment of groups of sequences, adds to these (e.g. genetic algorithms) Locally conserved patterns Statistical and probabilistic methods

Dynamic Programming
Computational complexity even worse than for pair-wise alignment because were finding all the paths through an ndimensional hyperspace (We can picture this in 2 or 3 dimensions.) Can align about 7 relatively short (200300) protein sequences in a reasonable amount of time; not much beyond that

A Heuristic for Reducing the Search Space in Dynamic Programming


Lets picture this in 3 dimensions (pp. 146-157 in book). It generalizes to n. Consider the pair-wise alignments of each pair of sequences. Create a phylogenetic tree from these scores. Consider a multiple sequence alignment built from the phylogenetic tree. These alignments circumscribe a space in which to search for a good (but not necessarily optimal) alignment of all n sequences.

Phylogenetic Tree
Dynamic programming uses a phylogenetic tree to build a first-cut msa The tree shows how protein could have evolved from shared origins over evolutionary time. See page 143 in Bioinformatics by Mount. Chapter 6 goes into detail on this.

Dynamic Programming -- MSA

Create a phylogenetic tree based on pair-wise alignments (Pairs of sequences that have the best scores are paired first in the tree.) Do a first-cut msa by incrementally doing pair-wise alignments in the order of alikeness of sequences as indicated by the tree. Most alike sequences aligned first. Use the pair-wise alignments and the first-cut msa to circumscribe a space within which to do a full msa that searches through this solution space. The score for a given alignment of all the sequences is the sum of the scores for each pair, where each of the pair-wise scores is multiplied by a weight indicating how far the pair-wise score differs from the first-cut msa alignment score.

Heuristic Dynamic Programming Method for MSA


Does not guarantee an optimal alignment of all the sequences in the group. Does get an optimal alignment within the space chosen.

Progressive Methods

Similar to dynamic programming method in that it uses the first step (i.e., it creates a phylogenetic tree, aligns the most-alike pair, and incrementally adds sequences to the alignment in order of alikeness as indicated by the tree.) Differs from dynamic programming method for MSA in that it doesnt refine the first-cut MSA by doing a full search through the reduced search space. (This is the computationally expensive part of DP MSA in that, even though weve cut down the search space, its still big when we have many sequences to align.)

Progressive Method

Generally proceeds as follows: Choose a starting pair of sequences and align them Align each next sequence to those already aligned, one at a time Heuristic method doesnt guarantee an optimal alignment Details vary in implementation: How to choose the first sequence to align? Align all subsequence sequences cumulatively or in subfamilies? How to score?

ClustalW

Based on phylogenetic analysis A phylogenetic tree is created using a pairwise distance matrix and nearest-neighbor algorithm The most closely-related pairs of sequences are aligned using dynamic programming Each of the alignments is analyzed and a profile of it is created Alignment profiles are aligned progressively for a total alignment W in ClustalW refers to a weighting of scores depending on how far a sequence is from the root on the phylogenetic tree (See p. 154 of Bioinformatics by Mount.)

Problems with Progressive Method


Highly sensitive to the choice of initial pair to align. If they arent very similar, it throws everything off. Its not trivial to come up with a suitable scoring matrix or gap penaties.

Iterative Methods for Multiple Sequence Alignment


Get an alignment. Refine it. Repeat until one msa doesnt change significantly from the next. An example is genetic algorithm approach.

Genetic Algorithms
A general problem solving method modeled on evolutionary change. Create a set of candidate solutions to your problem, and cause these solutions to evolve and become more and more fit over repeated generations. Use survival of the fittest, mutation, and crossover to guide evolution.

Evolutionary Change in Genetic Algorithms


survival of the fittest the best solutions survive and reproduce to the next generation mutation some solutions mutate in random ways (but they must always remain viable solutions) crossover solutions exchange parts

Laying Out the Problem


What would a candidate solution look like in a multiple sequence alignment program? (an msa of ~20 proteins) How many candidate solutions should there be? (~100)

Evolving to a Next Generation

Which candidate solutions should survive to the next generation?


First,

take the top half based on best sum of pairs scores Then randomly select second half, giving more chance to an msas being selected in proportion to how good its score is

How would mutation work?


Cant change a sequence in the msa. Otherwise you would be created a solution that isnt really a solution. You can only insert or rearrange gaps.

How would crossover work?

See page 160 in Bioinformatics by Mount.

Profiles and Motifs

A sequence motif is a relatively short pattern that appears consistently with a family of proteins. (Motifs can also appear in families of DNA or RNA molecules.) Frequently, motif-based analysis is used to detect patterns of amino acids in proteins that correspond to structural or functional features. Motifs are generated during multiple sequence alignment. They can be displayed as patterns of amino acids, as sequence logos, or as profile scoring matrices.

You might also like