This assignment is to familiarize you with Genbank
flatfile search and retrieval as well as interpretation of FASTA and BLAST results. You need to chose a protein sequence to use in your assignment. This year we are going to work on the Ciliate Tetrahymena thermophila genome. All sequences must be chosen from this organism. No two students should work with the same sequence.
Nobel Prize 2009 for Tetrahymena
Figures courtesy of Wikipedia
Go to NCBI and search the Protein Database for
Tetrahymena thermophila. This will link to 2164 pages with about 53000 gene records for this organism. Choose a page in the T. thermophila RefSeq database at random and go to it. Everyone works on a different gene. Open some of them to familiarize yourself with their structure (Genbank flatfiles). Does the Gene link display genomic position and intron/exon structure. Go back and in related information on the right, Click on the protein link (and several other links)
Ideally you will choose one which does not have
large numbers of orthologues displaying high identity with your choice. To find out, from the Protein flatfile, Click Blink to check for precomputed orthologues. The graphic gives you a sense of how many highly similar proteins are present in the database. Click on the Score for several of the alignments (include a high and a low one). This lets you see the distribution of identities and similarities along the primary sequence.
Click on the Multiple Alignment button at the top.
This will align the various proteins from Blink with each other. In the graphic, using the control at the bottom examine the alignment. Note insertions and deletion as well as the most conserved regions. What does the colour coding represent.
Save proof of your explorations as Printout
(to PDF, or screen captures) the pages you have visited for your protein. Append them in your report. Print the Blink page and some evidence of exploration of the other links. (Sometimes it is best to capture these as a screen shot, which can be pasted to the data section of your report PDF). The BLAST scores represent the high scores for alignments of your protein against all other proteins at NCBI. Lets look at how these alignments are generated, first with FASTA and then with BLAST.
do a FASTA search (http://www2.ebi.ac.uk/fasta33/)
Paste in your sequence (FASTA format, run with default parameters). Save the output hit list to your hard drive. Click on the FASTA Result or Visual FASTA button. This will load the actual output of sequence comparisons. Save to your hard drive. Run the search against the entire Uniprot Knowledge database, the human database alone, and the Archaea alone, using the default settings. Download the outputs to your harddrive for use in your report.
The default setting is ktup 2. Run these searches again against
human and Archaea with ktup1. Does it change the hit list, scores or E-values (look especially at higher numerical E values)? Default is Blosum 50. Set Blosum to 80 and run against Archaea (leaving other settings at default values). Save your results. Does anything change?
Prepare a report: 1 pages max. that clearly shows the effects of
varying these parameters. - cut and paste the data in, draw conclusions based on your data which will be appended. All statements must be supported by data! The result will be unique and highly dependent on which protein is used and whether it is strongly or weakly conserved through evolution.
run a BLASTP (NCBI, BLAST, Protein-Protein) search
against the nr (non-redundant) database use the same protein as for the FASTA search Open Algorithm Parameters at the bottom of the page. Explore Ktup (Word Size)(2 or 3), Blosum 45 and 80 and run against the Archaea and Human databases (default parameters only). Compare your results. Try a comparison of Gap Cost set to minimum and maximum against the human database. What changes? put together a 1 page max. report on Blast. Describe your findings, draw conclusions, and compare the results to that found with FASTA
The initial page of the BLAST result shows a search
of the conserved domains database. Click on the graphic for the conserved domains. What types are present in your protein? What do they represent? The results that you get are unique to your choice of protein, the parameters and the database searched. No two reports should be alike. Look for changes in the scores, alignments, particularly lower in the hit lists. Also note if new proteins get included/excluded as parameters change.