You are on page 1of 30

BLOSUM Matrices

Henikoff and Henikoff devised this matrix in 1992

The ground work for the development of new matrices was a


study aimed at identifying conserved motifs within families of
proteins.

This study lead to the creation of BLOCKS database, which uses


the concept of block to identify a family of proteins.

The idea of block is derived from the more familiar notion of a motif,
which usually refers to a conserved stretch of amino acids that
confer a specific function or structure of protein.

When these individual motifs from proteins in the same family can
Be aligned without introducing a gap, the result is a BLOCK
With these protein blocks in hand, it was then possible to look for
substitution patterns only in the most conserved regions of a protein,
the regions that were least prone to change.

Two thousand blocks representing more than 500 groups of related


proteins were examined and based on the substitution patterns in
those conserved blocks, blocks substitution matrices (or BLOSUM,
for short) were generated .

Distinction between the BLOSUM and PAM matrices is that:

BLOSUM matrices are directly calculated across varying evolutionary


distances and not extrapolated, provides a more accurate view of
substitution patterns (and, in turn evolutionary forces) at those
various distances.
The fact that the BLOSUM matrices are calculated directly based
only on conserved regions makes these matrices more sensitive
to detect structural or functional substitutions;

Because of this, the BLOSUM matrices perform demonstrably


better than the PAM matrices for local similarity searches.

Each BLOSUM matrix is assigned a number (BLOSUMn) and


that number represents the conservation level of the sequences
that were used to derive that particular matrix.

BLOSUM62 means matrix is calculated from sequences sharing


no more than 62% identity;

Sequences more than 62% identity are clustered and their


contribution is weighted to one.

Reduction in the value of n yields more distantly related sequences.


BLOSUM62 Matrix

Pij is the probability of any amino acid that is replaced by any other amino acid

qi, qj are the background probabilties of finding the amino acids i,j in any protein
sequences
Selecting an appropriate Scoring Matrix
Matrix Best Use Similarity
(%)

PAM40 Short alignments that are highly similar 70-90


PAM160 Detecting members of a protein family 50-60
PAM250 Longer alignments of more divergent sequences ~30

BLOSUM90 Short alignments that are highly similar 70-90


BLOSUM80 Detecting memebers of a protein family 50-60
BLOSUM62 Most effective in finding all potential similarities 30-40
BLOSUM30 Longer alignments of more divergent sequences <30

Equivalencies are useful in relating PAM and BLOSUM matrices:

PAM250 is equivalent to BLOSUM45


PAM160 is equivalent to BLOSUM62
PAM120 is equivalent to BLOSUM80
GAP and GAP penalties

Gaps are introduced into alignments to compensate for


insertion and deletions between the sequences being studied.

Affine gap penalty: This is most widely used. The equation

G + Ln

G : Gap-opening Penalty (for gap initiation) : 11


L : Gap-extension Penalty (for gap extension) :1
n : Length of the gap

Because gap-opening penalty is larger than the gap extension


penalty lengthening existing gap is favoured than creating new
One.
Other gap penalty: Non-affine or liner, gap penalty.

No cost for opening gap: Simple mismatch penalty is


assessed for each position of the gap.
BLAST
BLAST- Basic local Alignment Search Tool

It is capable of detecting not only the best region of local


Alignment between a query and the target but also
Whether there are also other plausible alignments between
them.

To find these regions of local alignment in a computationally


Effecient fasion the method begins by Seeding the search
with small subset of letters from the query sequence, known
as the query word
Karlin-Altschul Equation

E = kmNe-s

K is a minor constant

m is the number of letters in the query

N is the total number of letters in the target database

is a constant used to normalize the raw score of the


high-scoring segment pair

S is the score of the high-scoring Segment


For BLAST GLOSSORY

http://www.ncbi.nlm.nih.gov/books/NBK62051/
PSI-BLAST

Position Specific Iteration BLAST

It constructs Position Specific Scoring Martix (PSSM) based on


Multiple sequence Alignment (MSA).
FASTA
Another method for local sequence alignment.

Maintained by Dr. William Pearson at the


University of Virginia

http://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml

FASTA was the first widely used program


designed for database similarity
FASTA (Pearson and Lipman 1988)
This is a combination of word search and Smith-
Waterman algorithm

The query sequence is divided into small words of


certain size.

The initial comparison of the query sequence to the


database is performed using these words.

If these words are located on the same diagonal in


an array the region surrounding the diagonals are
analyzed further.

Search time is only proportional to size of database


FASTA Algorithm
FASTA ktups are shorter than BLAST words (W).

1-2 for proteins and 4-6 for nucleic acids.

Lower ktups give a more slower, more sensitive search.

Higher ktups give a faster search with fewer false


positives
The FASTA program uses Hash tables. These tables speed
up the process of word search.

Query Sequence = TCTCTC


123456 (position number)
Database Sequence = TTCTCTC
1234567 (position number)
You choose to use word size = 4 for your
table (total number of words in your table is
44 = 256)

Sequence
Position w/in query Position w/in DB Offset (Q minus DB)
(total of 256)
TCTC 1,3 2,4 -1 or -3 or 1
CTCT 2 3 -1
TTCT 1
FASTA Steps
Different offset values
1 2
Identical offset
Diagonals are extended
values in a
contiguous sequence

Local regions of Rescore the local regions


identity are found using PAM or Blos. matrix

3 4

Eliminate short diagonals Create a gapped alignment in


below a cutoff score a narrow segment and then
perform S-W alignment
Summary of FASTA steps
1. Analyze the database for identical matches that are contiguous
(between 5 and 10 amino acids in length (same offset values)).

2. Longest diagonals are scored again using the PAM matrix (or
other matrix). The best scores are saved as init1 scores.

3. Short diagonals are removed.

4. Long diagonals that are neighbors are joined. The score for
this joined region is initn. This score may be lower due to a
penalty for a gap.

5. A S-W dynamic programming alignment is performed around


the joined sequences to give an opt score.

Thus, the time-consuming S-W step is performed only on top


scoring sequences
The ktup value
The ktup (for k-tuples) value stands for the length of the
word used to search for identity.

For proteins a ktup value of 3 would give a hash table of


203 elements (8000 entries).

The higher the ktup value the less likely you will get a
match unless it is identical (remember the dot plots)

The lower the ktup value the more background you will
have

The higher the ktup value the faster analysis (fewer


diagonals).
FASTA Versions
FASTA - nucleotide or protein sequence searching

FASTx/-compares a translated DNA query sequence

FASTy to a protein sequence database (forward


or backward translation of the query)

tFASTx/-compares protein query sequence to


tFASTy DNA sequence database that has been
translated into three forward and three
reverse reading frames
FASTA Statistical Significance

Z score for a single alignment=


(similarity score - mean score from database)
standard deviation from database


( scores) 2
Stand. Dev. = scores2 -
Total#ofSequences
Total#ofSequences
FASTA Statistics
Using the distribution of the z-scores in the database, the
FastA program can estimate the number of sequences that
would be expected to produce, purely by chance, a z-score
greater than or equal to the z-score obtained in the search.

This is reported as the E() or expect value.

This value is the number of sequences you would expect to


find with this score by searching a database of random
sequences.

Thus, when z the E()


Evaluating the Results of FASTA
Best
SCORES Init1: 2847 Initn: 2847 Opt: 2847
z-score: 2609.2 E(): 1.4e-138
Smith-Waterman score: 2847; 100.0% identity in 413 overlap

Good
SCORES Init1: 719 Initn: 748 Opt: 793
z-score: 734.0 E(): 3.8e-34
Smith-Waterman score: 796; 41.3% identity in 378 overlap

Mediocre
SCORES Init1: 249 Initn: 304 Opt: 260
z-score: 243.2 E(): 8.3e-07
Smith-Waterman score: 270; 35.0% identity in 183 overlap
When to use the correct program
Problem Program Explanation
Identify BLASTP; General protein
comparison. Use ktup=2
Unknown FASTA3
for speed; ktup=1 for
Protein sensitive search.

Smith-Waterman Slower than FASTA3 and


BLAST but provides
maximum sensitivity

TFASTX3;TFASTY3; Use if homolog cannot be


found in protein
TBLASTN
databases; Approx. 33%
slower

Psi-BLAST Finds distantly related


sequences. It replaces
the query sequence with a
position-specific score
matrix after an initial
BLASTP search. Then it
uses this matrix to find
distantly related
sequences
When to use the correct program (contd..)
Problem Program Explanation
Identify TFASTX3;TFASTY3 Use PAM matrix <=20 or
BLOSUM90 to avoid detecting
new TBLASTN:TBLASTX
distant relationships. Search
orthologs in closely EST sequences w/in the same
related species species.

Identify FASTX3;FASTY3; Always attempt to translate


your sequence into protein
EST BLASTX;TBLASTX prior to searching.
Sequence
Identify FASTA;BLASTN Nucleotide sequence
DNA comparision
Sequence

TBLASTX-nucleotide query-translated nucleotide DB

BLASTX-nucleotide query-protein DB

You might also like