Unit2 2

BLOSUM Matrices
Henikoff and Henikoff devised this matrix in 1992
The ground work for the development of new matrices was a

study aimed at identifying conserved motifs within families of
proteins.
This study lead to the creation of BLOCKS database, which uses

the concept of block to identify a family of proteins.
The idea of block is derived from the more familiar notion of a motif,
which usually refers to a conserved stretch of amino acids that
confer a specific function or structure of protein.
When these individual motifs from proteins in the same family can
Be aligned without introducing a gap, the result is a BLOCK
With these protein blocks in hand, it was then possible to look for
substitution patterns only in the most conserved regions of a protein,
the regions that were least prone to change.
Two thousand blocks representing more than 500 groups of related

proteins were examined and based on the substitution patterns in
those conserved blocks, blocks substitution matrices (or BLOSUM,
for short) were generated .
Distinction between the BLOSUM and PAM matrices is that:
BLOSUM matrices are directly calculated across varying evolutionary

distances and not extrapolated, provides a more accurate view of
substitution patterns (and, in turn evolutionary forces) at those
various distances.
The fact that the BLOSUM matrices are calculated directly based
only on conserved regions makes these matrices more sensitive
to detect structural or functional substitutions;
Because of this, the BLOSUM matrices perform demonstrably

better than the PAM matrices for local similarity searches.
Each BLOSUM matrix is assigned a number (BLOSUMn) and

that number represents the conservation level of the sequences
that were used to derive that particular matrix.
BLOSUM62 means matrix is calculated from sequences sharing

no more than 62% identity;
Sequences more than 62% identity are clustered and their

contribution is weighted to one.
Reduction in the value of n yields more distantly related sequences.

BLOSUM62 Matrix
Pij is the probability of any amino acid that is replaced by any other amino acid
qi, qj are the background probabilties of finding the amino acids i,j in any protein
sequences
Selecting an appropriate Scoring Matrix
Matrix Best Use Similarity
(%)
PAM40 Short alignments that are highly similar 70-90

PAM160 Detecting members of a protein family 50-60
PAM250 Longer alignments of more divergent sequences ~30
BLOSUM90 Short alignments that are highly similar 70-90

BLOSUM80 Detecting memebers of a protein family 50-60
BLOSUM62 Most effective in finding all potential similarities 30-40
BLOSUM30 Longer alignments of more divergent sequences <30
Equivalencies are useful in relating PAM and BLOSUM matrices:
PAM250 is equivalent to BLOSUM45

GAP and GAP penalties
Gaps are introduced into alignments to compensate for

insertion and deletions between the sequences being studied.
Affine gap penalty: This is most widely used. The equation
G + Ln
G : Gap-opening Penalty (for gap initiation) : 11

L : Gap-extension Penalty (for gap extension) :1
n : Length of the gap
Because gap-opening penalty is larger than the gap extension

penalty lengthening existing gap is favoured than creating new
One.
Other gap penalty: Non-affine or liner, gap penalty.
No cost for opening gap: Simple mismatch penalty is

assessed for each position of the gap.
BLAST
BLAST- Basic local Alignment Search Tool
It is capable of detecting not only the best region of local

Alignment between a query and the target but also
Whether there are also other plausible alignments between
them.
To find these regions of local alignment in a computationally

Effecient fasion the method begins by Seeding the search
with small subset of letters from the query sequence, known
as the query word
Karlin-Altschul Equation
E = kmNe-s
K is a minor constant
m is the number of letters in the query
N is the total number of letters in the target database
is a constant used to normalize the raw score of the

high-scoring segment pair
S is the score of the high-scoring Segment

For BLAST GLOSSORY
http://www.ncbi.nlm.nih.gov/books/NBK62051/
PSI-BLAST
Position Specific Iteration BLAST
It constructs Position Specific Scoring Martix (PSSM) based on

Multiple sequence Alignment (MSA).
FASTA
Another method for local sequence alignment.
Maintained by Dr. William Pearson at the

University of Virginia
http://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml
FASTA was the first widely used program

designed for database similarity
FASTA (Pearson and Lipman 1988)
This is a combination of word search and Smith-
Waterman algorithm
The query sequence is divided into small words of

certain size.
The initial comparison of the query sequence to the

database is performed using these words.
If these words are located on the same diagonal in

an array the region surrounding the diagonals are
analyzed further.
Search time is only proportional to size of database

FASTA Algorithm
FASTA ktups are shorter than BLAST words (W).
1-2 for proteins and 4-6 for nucleic acids.
Lower ktups give a more slower, more sensitive search.
Higher ktups give a faster search with fewer false

positives
The FASTA program uses Hash tables. These tables speed
up the process of word search.
Query Sequence = TCTCTC

123456 (position number)
Database Sequence = TTCTCTC
1234567 (position number)
You choose to use word size = 4 for your
table (total number of words in your table is
44 = 256)
Sequence
Position w/in query Position w/in DB Offset (Q minus DB)
(total of 256)
TCTC 1,3 2,4 -1 or -3 or 1
CTCT 2 3 -1
TTCT 1
FASTA Steps
Different offset values
1 2
Identical offset
Diagonals are extended
values in a
contiguous sequence
Local regions of Rescore the local regions

identity are found using PAM or Blos. matrix
3 4
Eliminate short diagonals Create a gapped alignment in

below a cutoff score a narrow segment and then
perform S-W alignment
Summary of FASTA steps
1. Analyze the database for identical matches that are contiguous
(between 5 and 10 amino acids in length (same offset values)).
2. Longest diagonals are scored again using the PAM matrix (or
other matrix). The best scores are saved as init1 scores.
3. Short diagonals are removed.
4. Long diagonals that are neighbors are joined. The score for
this joined region is initn. This score may be lower due to a
penalty for a gap.
5. A S-W dynamic programming alignment is performed around

the joined sequences to give an opt score.
Thus, the time-consuming S-W step is performed only on top

scoring sequences
The ktup value
The ktup (for k-tuples) value stands for the length of the
word used to search for identity.
For proteins a ktup value of 3 would give a hash table of

203 elements (8000 entries).
The higher the ktup value the less likely you will get a
match unless it is identical (remember the dot plots)
The lower the ktup value the more background you will
have
The higher the ktup value the faster analysis (fewer

diagonals).
FASTA Versions
FASTA - nucleotide or protein sequence searching
FASTx/-compares a translated DNA query sequence
FASTy to a protein sequence database (forward

or backward translation of the query)
tFASTx/-compares protein query sequence to

tFASTy DNA sequence database that has been
translated into three forward and three
reverse reading frames
FASTA Statistical Significance
Z score for a single alignment=

(similarity score - mean score from database)
standard deviation from database

( scores) 2
Stand. Dev. = scores2 -
Total#ofSequences
Total#ofSequences
FASTA Statistics
Using the distribution of the z-scores in the database, the
FastA program can estimate the number of sequences that
would be expected to produce, purely by chance, a z-score
greater than or equal to the z-score obtained in the search.
This is reported as the E() or expect value.
This value is the number of sequences you would expect to

find with this score by searching a database of random
sequences.
Thus, when z the E()

Evaluating the Results of FASTA
Best
SCORES Init1: 2847 Initn: 2847 Opt: 2847
z-score: 2609.2 E(): 1.4e-138
Smith-Waterman score: 2847; 100.0% identity in 413 overlap
Good
z-score: 734.0 E(): 3.8e-34
Mediocre
z-score: 243.2 E(): 8.3e-07
When to use the correct program
Problem Program Explanation
Identify BLASTP; General protein
comparison. Use ktup=2
Unknown FASTA3
for speed; ktup=1 for
Protein sensitive search.
Smith-Waterman Slower than FASTA3 and

BLAST but provides
maximum sensitivity
TFASTX3;TFASTY3; Use if homolog cannot be

found in protein
TBLASTN
databases; Approx. 33%
slower
Psi-BLAST Finds distantly related

sequences. It replaces
the query sequence with a
position-specific score
matrix after an initial
BLASTP search. Then it
uses this matrix to find
distantly related
sequences
When to use the correct program (contd..)
Problem Program Explanation
Identify TFASTX3;TFASTY3 Use PAM matrix <=20 or
BLOSUM90 to avoid detecting
new TBLASTN:TBLASTX
distant relationships. Search
orthologs in closely EST sequences w/in the same
related species species.
Identify FASTX3;FASTY3; Always attempt to translate

your sequence into protein
EST BLASTX;TBLASTX prior to searching.
Sequence
Identify FASTA;BLASTN Nucleotide sequence
DNA comparision
Sequence
TBLASTX-nucleotide query-translated nucleotide DB
BLASTX-nucleotide query-protein DB

Unit2 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit2 2

Uploaded by

Copyright:

Available Formats

BLOSUM Matrices

Henikoff and Henikoff devised this matrix in 1992

The ground work for the development of new matrices was a

This study lead to the creation of BLOCKS database, which uses

Two thousand blocks representing more than 500 groups of related

Distinction between the BLOSUM and PAM matrices is that:

BLOSUM matrices are directly calculated across varying evolutionary

Because of this, the BLOSUM matrices perform demonstrably

Each BLOSUM matrix is assigned a number (BLOSUMn) and

BLOSUM62 means matrix is calculated from sequences sharing

Sequences more than 62% identity are clustered and their

Reduction in the value of n yields more distantly related sequences.

PAM40 Short alignments that are highly similar 70-90

BLOSUM90 Short alignments that are highly similar 70-90

Equivalencies are useful in relating PAM and BLOSUM matrices:

PAM250 is equivalent to BLOSUM45

Gaps are introduced into alignments to compensate for

Affine gap penalty: This is most widely used. The equation

G : Gap-opening Penalty (for gap initiation) : 11

Because gap-opening penalty is larger than the gap extension

No cost for opening gap: Simple mismatch penalty is

It is capable of detecting not only the best region of local

To find these regions of local alignment in a computationally

m is the number of letters in the query

N is the total number of letters in the target database

is a constant used to normalize the raw score of the

S is the score of the high-scoring Segment

Position Specific Iteration BLAST

It constructs Position Specific Scoring Martix (PSSM) based on

Maintained by Dr. William Pearson at the

FASTA was the first widely used program

The query sequence is divided into small words of

The initial comparison of the query sequence to the

If these words are located on the same diagonal in

Search time is only proportional to size of database

1-2 for proteins and 4-6 for nucleic acids.

Lower ktups give a more slower, more sensitive search.

Higher ktups give a faster search with fewer false

Query Sequence = TCTCTC

Local regions of Rescore the local regions

Eliminate short diagonals Create a gapped alignment in

3. Short diagonals are removed.

5. A S-W dynamic programming alignment is performed around

Thus, the time-consuming S-W step is performed only on top

For proteins a ktup value of 3 would give a hash table of

The higher the ktup value the faster analysis (fewer

FASTx/-compares a translated DNA query sequence

FASTy to a protein sequence database (forward

tFASTx/-compares protein query sequence to

Z score for a single alignment=

This is reported as the E() or expect value.

This value is the number of sequences you would expect to

Thus, when z the E()

Smith-Waterman Slower than FASTA3 and

TFASTX3;TFASTY3; Use if homolog cannot be

Psi-BLAST Finds distantly related

Identify FASTX3;FASTY3; Always attempt to translate

TBLASTX-nucleotide query-translated nucleotide DB

You might also like