You are on page 1of 10

Assignment 1: FASTA and BLAST

This assignment is to familiarize you with Genbank


flatfile search and retrieval as well as interpretation
of FASTA and BLAST results.
You need to chose a protein sequence to use in your
assignment. This year we are going to work on the
Ciliate Tetrahymena thermophila genome.
All sequences must be chosen from this organism.
No two students should work with the same
sequence.

Nobel Prize 2009 for Tetrahymena

Figures courtesy of Wikipedia

Go to NCBI and search the Protein Database for


Tetrahymena thermophila. This will link to 2164
pages with about 53000 gene records for this
organism.
Choose a page in the T. thermophila RefSeq database
at random and go to it. Everyone works on a
different gene.
Open some of them to familiarize yourself with
their structure (Genbank flatfiles). Does the Gene
link display genomic position and intron/exon
structure.
Go back and in related information on the right,
Click on the protein link (and several other links)

Ideally you will choose one which does not have


large numbers of orthologues displaying high
identity with your choice.
To find out, from the Protein flatfile, Click Blink
to check for precomputed orthologues. The
graphic gives you a sense of how many highly
similar proteins are present in the database.
Click on the Score for several of the alignments
(include a high and a low one). This lets you see
the distribution of identities and similarities along
the primary sequence.

Click on the Multiple Alignment button at the top.


This will align the various proteins from Blink with
each other.
In the graphic, using the control at the bottom
examine the alignment. Note insertions and deletion
as well as the most conserved regions. What does
the colour coding represent.

Save proof of your explorations as Printout


(to PDF, or screen captures) the pages you
have visited for your protein. Append them in
your report.
Print the Blink page and some evidence of
exploration of the other links. (Sometimes it is
best to capture these as a screen shot, which
can be pasted to the data section of your
report PDF).
The BLAST scores represent the high scores for
alignments of your protein against all other
proteins at NCBI. Lets look at how these
alignments are generated, first with FASTA and
then with BLAST.

do a FASTA search (http://www2.ebi.ac.uk/fasta33/)


Paste in your sequence (FASTA format, run with default
parameters). Save the output hit list to your hard drive.
Click on the FASTA Result or Visual FASTA button. This
will load the actual output of sequence comparisons. Save
to your hard drive.
Run the search against the entire Uniprot Knowledge
database, the human database alone, and the Archaea
alone, using the default settings.
Download the outputs to your harddrive for use in your
report.

The default setting is ktup 2. Run these searches again against


human and Archaea with ktup1. Does it change the hit list, scores or
E-values (look especially at higher numerical E values)?
Default is Blosum 50. Set Blosum to 80 and run against Archaea
(leaving other settings at default values). Save your results. Does
anything change?

Prepare a report: 1 pages max. that clearly shows the effects of


varying these parameters. - cut and paste the data in, draw
conclusions based on your data which will be appended. All
statements must be supported by data!
The result will be unique and highly dependent on which protein is
used and whether it is strongly or weakly conserved through
evolution.

run a BLASTP (NCBI, BLAST, Protein-Protein) search


against the nr (non-redundant) database
use the same protein as for the FASTA search
Open Algorithm Parameters at the bottom of the page.
Explore Ktup (Word Size)(2 or 3), Blosum 45 and 80 and
run against the Archaea and Human databases (default
parameters only). Compare your results.
Try a comparison of Gap Cost set to minimum and
maximum against the human database. What changes?
put together a 1 page max. report on Blast.
Describe your findings, draw conclusions, and compare
the results to that found with FASTA

The initial page of the BLAST result shows a search


of the conserved domains database. Click on the
graphic for the conserved domains.
What types are present in your protein?
What do they represent?
The results that you get are unique to your choice of
protein, the parameters and the database searched.
No two reports should be alike.
Look for changes in the scores, alignments,
particularly lower in the hit lists. Also note if new
proteins get included/excluded as parameters change.

You might also like