Professional Documents
Culture Documents
Sequence Alignment
We need to have some sequences to work with in this lab. In the Biological Databases lab, we
learned how to look up sequence information in one of the many databases available online.
Look up the protein sequences of the genes of your choice and download them to your home
directory as FASTA files.
Why not the nucleotide sequence? Well, many different triplet codons translate to the same
amino acid and what ultimately determines the structure and thus, function of the protein is its
amino acid sequence. So most of the time, it makes more sense to align protein sequences
rather than nucleotide sequences. Obviously, this is not a strict rule -- there are definitely times
when you want to align nucleotide sequences, e.g. for non-protein-coding sequences, regulatory
regions of DNA, etc etc.
when you enter a query sequence, it essentially performs many local alignments for you against
the sequences in the database you choose and gives you the set of highest scoring alignments.
Obviously, it's not performing thousands and thousands of Needleman-Wunschs or you would
never get a result! In fact, BLAST uses a heuristics-based approach by first splitting up the
sequence you enter into a bunch of "words", scanning for these words in the sequence
database, and then using the search results to seed alignments extending from the found
words. You can learn a lot more about the actual workings of BLAST this chapter of the NCBI
Handbook.
Having an available search tool available doesn't necessarily mean life becomes simple! Look
at all the choices you have before you even enter a query sequence! Luckily, the people at
NCBI have spent a lot of time writing documentation on BLAST, so check out this table to see
what the differences between all the various BLAST programs are. For now, just focus on the
ones for protein queries.
Since our goal here is to look for sequences similar to the query sequence and our query
sequence is most likely longer than 15 residues, protein blast (blastp) seems like a
decent choice. So select protein blast (blastp) from the table (or from the main BLAST
page).
You should now see a page with a form containing many boxes for you to fill in. The
main text box labeled 'Enter accession number, gi, or FASTA sequence' is where you
put in your query sequence. Not sure what kind of format the sequence should be in?
Click on the link next to 'Enter accession number, gi, or FASTA sequence' for an
explanation.
How do we tell BLAST where to look for similar sequences? That's specified by the
database we use for the query, chosen by the 'Database' dropdown box. To learn what
the different databases are, click on the link next to the 'Database' dropdown. Choosing
an appropriate database is an important step, since the results you get completely
depend on which databases you choose to query. If you only cared about finding similar
sequences in human, you may want to use the 'Organism' text box to narrow down or
exclude organisms rather than searching the entire database, which would search all
sequences regardless of organism.
There are many other settings for blastp and I suggest clicking on the explanation for
each to find out what that setting is for. If you're really itching to do your first alignment,
just leave everything at its default value and click on the 'Blast' button.
Once you submit your request, you'll be directed to a page with your Request ID. Wait a
bit to let your request go through the queue. By default, BLASTP will also run a search
against the Conserved Domain Database (CDD) and display the results graphically while
it performs the Blast search. The graphic shows protein domains that may be present in
your query. Click on the graphic to get more information about the search results. Here's
a page with help on CDD.
When the server finishes processing your request, it'll show you the set of sequences it
found. Congrats! You just performed your first BLAST search! Below where the results
show your query and its length is a graphic titled "Distribution of BLAST Hits". This
graphic shows you at a glance what were the significant matches and where they match
up with your query. You should be able to tell from this whether the matches are local or
global with respect to your query. These matches are also referred to as "target"
sequences. Scroll down a bit and you'll see a big list of whole sequences or sequence
fragments that matched your query sequence.
o Associated with each sequence is a 'score', which is the pairwise alignment
score between that sequence and your query sequence.
o You'll also see something called an 'E-value', which is the 'expect value', and
essentially tells you how statistically significant your alignment is (more here).
The E-value is calculated from the length of the query sequence and the
database size. The smaller the E-value the more significant the match. A rule of
thumb is that E-values less than 1e-3 are significant matches.
o Below the line showing the Expect value are three numbers identified as
"Identities", "Positives", and "Gaps". Identities give the number of exact matches
between the query and the target sequence over the length of this alignment. A
general rule of thumb for %id is that it should be 25-30% over an alignment of at
least 80-100 amino acids to be able to assert that the sequences are
homologous, meaning that they are evolutionarily related. If sequences are
homologous, then you have a better case for asserting they share a common
function. "Positives" takes into account both exact matches and conservative
substitutions (ie, similar amino acids). "Gaps" refers to the number of gaps in the
alignment. Of course, the lower the number of gaps, the better the alignment.
o By default, BLAST will filter out low-complexity regions which have biased aminoacid composition (eg, sequences of repeated amino acids) because they will
skew the results. In the output, BLAST will display these regions grayed out and
in lower case.
Note that the alignment scores are based on the scoring matrix that BLAST used - for protein
sequence alignments, this is a very important setting and how you choose an appropriate matrix
depends on many things, including what kind of sequence similarities you want to detect, how
long your sequences are, etc. Here's a longer explanation of substitution matrices.
1.
Graphic Display
2.
Hit List
3.
Alignment
Blast Exercise
The gene DCC is deleted in colorectal cancer and is located on human chromosome 18q21.3. It
encodes for a tumor suppressor protein. Expression of the gene is reduced significantly in most
colorectal carcinomas. The protein sequence of human DCC has the Refseq accession number
of NP_005206.
Locate the Genbank record for this protein. Note the length of the amino acid sequence.
Perform a BLASTP search using this protein as the query and Swiss-Prot as the target
database. Limit the search to mammalian species only and use BLOSUM62 as the
scoring matrix.
o The DCC protein from human is most closely related to the DCC protein from
what other mammal?
o What percent identity do they share?
o What is their percent similarity?
o What is the length of the alignment? Were both proteins aligned along their entire
length?
o Does the DCC protein contain any low-complexity regions that have been
masked out by BLASTP? If so, where?
Look at the results for protein with Swiss-Prot id of P97798.
o What percent identity does it share with the query?
o What is the alignment length? Is it a global or local alignment for the query and
the target?
Based on the BLASTP results, can any general observations be made regarding the
putative function or cellular role of DCC?
1.
You have a query sequence, which is 28 residues long. You BLAST this sequence against
a non-redundant protein database with 32576 sequences, and having total length of
6887085 amino acids. The best hit is in a sequence which is 375 amino acids long. Looking
at the alignment of the best hit, you observe the following (only the high scoring segment
pairs is shown):
Query
Sbjct
23
65
NFSSSQGY
NFSTSQGV
38
82
2. Imagine you have sequenced a novel fungus genome. There are 6500 predicted
genes in this genome. You do a BLAST search among known fungal genomes
using the translation of predicted genes.
a) Of the 6500 proteins, 5500 have a unique match in other species of fungi (S.
cerevisiae, S. pombe, or N. crassa) with a very low E-value. What can you say
about these proteins?
b) The remaining 1000 proteins each have a best match with an E-value larger than
10. What can you say about these proteins?
c) Of these last 1000 proteins, can you make another BLAST search to verify your
conclucion? What else can you try?
3.
We do a BLAST search to predict the function of our human query protein on two different internet sites
that provide a BLAST search tool. The alignments of best hits are given above.
a) Which hit is statistically more significant? Explain.
b) What is the reason for the difference between the two BLAST results?
We do a BLAST search to predict its funtion. Top 10 hits are given above.
a) Can you predict the function of this protein based on this output?
b) What can you do to improve this search?
BLOSUM62 Matrix
C
-1
-1
-3
-3
-3
-3
-4
-3
-3
-3
-3
-1
-1
-1
-1
-2
-2
-2
-1
-1
-1
-1
-1
-2
-2
-2
-2
-2
-3
-1
-1
-1
-1
-2
-2
-2
-2
-2
-3
-3
-1
-1
-2
-1
-1
-1
-1
-2
-2
-1
-2
-3
-3
-2
-4
-3
-4
-1
-1
-1
-2
-1
-1
-2
-1
-1
-1
-1
-1
-2
-2
-2
-3
-3
-2
-2
-1
-2
-2
-2
-2
-2
-3
-4
-4
-3
-3
-2
-3
-2
-2
-1
-2
-3
-3
-3
-3
-2
-4
-3
-1
-2
-1
-1
-2
-1
-3
-3
-4
-3
-3
-3
-4
-4
-1
-1
-2
-2
-3
-3
-3
-3
-2
-3
-3
-1
-1
-2
-3
-2
-2
-3
-1
-2
-3
-1
-2
-2
-2
-1
-2
-3
-3
-2
-1
-2
-3
-1
-1
-2
-1
-2
-2
-1
-3
-2
-3
-3
-2
-3
-3
-1
-1
-2
-1
-1
-1
-3
-2
-3
-3
-2
-3
-1
-1
-1
-2
-1
-3
-2
-3
-2
-2
-1
-1
-2
-1
-1
-1
-2
-2
-3
-1
-4
-3
-3
-3
-3
-3
-3
-3
-1
-3
-1
-2
-2
-3
-1
-4
-3
-4
-3
-2
-3
-2
-2
-1
-2
-1
-2
-2
-2
-3
-3
-3
-2
-2
-3
-3
-2
-1
-1
-3
-2
-2
-2
-4
-2
-3
-3
-3
-3
-3
-1
-3
-3
-1
-2
-2
-2
-3
-2
-3
-2
-3
-2
-1
-2
-2
-1
-1
-1
-1
-2
-3
-3
-4
-3
-2
-4
-4
-3
-2
-2
-3
-3
-1
-3
-2
-3