You are on page 1of 9

NATIONAL UNIVERSITY OF SINGAPORE

LSM2241 Introductory Bioinformatics (Semester II: 2010-2011) Time Allowed : 2 Hours

INSTRUCTIONS TO CANDIDATES

1. This examination paper consists of 2 sections (Sections A & B) printed on 9 pages, including this cover page. 2. SECTION A 3. SECTION B Answer all 20 MCQs. (20 marks)

Answer TWO out of FOUR questions. (20 marks)

4. This is an OPEN BOOK exam. 5. This examination paper contributes 40% towards the overall mark of the module.

SECTION A (20 marks) Multiple Choice Questions Answer all Questions 1. The current estimate for the number of human genes is 30,000. Assuming that a gene contains on average 1,000 nucleotides, genes occupy 30,000,000 bases out of the estimated 3.2 billions total nucleotides of the human genome. The remaining nucleotides are A. alien DNA that currently remains dormant. B. junk DNA. C. the real genetic information, as genes are only backup storage, which is only used in case of emergency. D. a combination of control regions for genes, RNA coding regions, and regions whose purpose is still not known. E. modified nucleotides (i.e. different from the standard A, T, G, or C) that cannot be read by the polymerases.

2. Which of the following statements is CORRECT? A. BLAST aligns protein sequences using the Smith and Waterman dynamic programming approach. B. BLAST aligns protein sequences using the Needleman and Wunsch dynamic programming approach. C. The Smith and Waterman algorithm assumes that all scores in the substitution matrix are negative. D. The Smith and Waterman algorithm assumes that all scores in the substitution matrix are positive E. The Needleman and Wunsch algorithm for sequence alignment was designed to generate global alignments.

3. The Ramachandran plot A. compares the conformation of the side-chains of a protein. B. shows the accessibility of all amino acids in a protein. C. shows the relationship between the torsion angles phi and psi, for each amino acid in the protein. D. shows the torsion angle around the peptide bond, for each amino acid in the protein. E. shows the number of hydrogen bonds that stabilize a protein.

4. A BLAST search is most useful when you want to do the following: A. Find inverted repeats within a protein sequence. B. Generate the best possible alignment between the target and template sequences to be used as input for homology modeling. C. Find a rat paralog to a human gene. D. Find a rice ortholog to a lettuce gene. E. Predict the secondary structures of a protein.

5. The three alignments given below have been computed for two DNA sequences, GAATTCAGTTA and GATCGA. The score for a perfect match is set to 10 and the score for a mismatch is set to 0. Gap penalties have been ignored. Alignment A: GAATTCAGTTA G-A-TC-G--A Alignment B: GAATTCAGTTA G-AT-C-G--A Alignment C: GAATTCAGTTA GA-T-C-G--A Which of the following statements is CORRECT? A. Alignment A is optimal, with a higher score than alignments B and C. B. A, B, and C are the only 3 optimal alignments for these two sequences. C. Alignment C is optimal, with a higher score than alignments A and B. D. None of these alignments are optimal. E. All three alignments have the same optimal score.

6. Given 2 protein sequences that came from very divergent organisms, which substitution matrix is the most suitable for aligning them? A. PAM1 B. BLOSUM100 C. BLOSUM90 D. BLOSUM80 E. BLOSUM32

7. The figure below shows a small peptide of four amino acids. Three of the amino acids are charged at neutral pH. Hydrogens are not shown. What is the sequence of the peptide?

A. B. C. D. E.

DEHF DEHY DEHW DEFY DKHY 3

8. Which of the following techniques does not provide information on the secondary and tertiary structure of a protein? A. X-ray crystallography B. Cryo-EM C. Circular dichroism D. Maxam-Gilbert sequencing E. NMR spectroscopy

9. Long branch attraction is a phylogenetic analysis artifact that arises when A. plant genomes are compared with algae genomes. B. xenologs or paralogs are used in phyogenetic reconstruction. C. the most divergent sequences in the multiple sequence alignment are grouped together at the end of the tree building process. D. the least divergent sequences in the multiple sequence alignment are grouped together at the beginning of the tree building process. E. protein sequences are used in phylogenetic tree construction rather than DNA sequences.

10. The five biological Big Bangs described by Eugene Koonin help us to understand why A. some human genes have BLAST hits to both Bacterial and Archaeal genes. B. RNA binding proteins show common ancestry between Bacteria and Archaea, but DNA binding proteins do not. C. the ATP synthetases of Bacteria and Archaea exhibit convergent evolution. D. the Class I and Class II Lysyl tRNA synthetases of Bacteria and Archaea exhibit convergent evolution. E. All of the above.

11. Given the following list of molecular evolutionary information: i. ii. iii. iv. 16S RNA sequence phylogenetic trees of Bacteria and Archaea Nucleic acid phylogenetic trees of viral subtypes Multidomain protein - full length phylogenetic trees from metazoans Protein domain family trees from 3D structural studies and secondary structure superposition

Which of the following shows the correct order from the oldest evolutionary information to the most recent? A. ii, i, iii, iv B. ii, iii, i, iv C. iii, iv, i, ii D. iv, i, iii, ii E. iv, ii, i, iii

12. We compare the differences between a sequence alignment generated by an algorithm that works with letter codes and a sequence alignment generated from a pair of 3D structures that are superimposed in space. Which of the following is an incorrect observation? A. There are no gaps in an alignment made from 3D structure superposition. B. Identical residues may be shifted by one or more character in a 3D structure based alignment. C. There are no BLOSUM or similar scoring matrices used in a 3D structure based alignment. D. Amino acids aligned correspond to the nearest pair of alpha carbons in 3-dimensional space. E. Structural alignments maintain the secondary structure.

13. Of the eight alternative hypotheses for the origin of Eukaryotes, which one of the following is the one not falsified by the supertree data in the paper Supertrees Disentagle the Chimerical Origin of Eukaryotic Genomes? A. A chimera involving an Euryarchaeota and an alpha-Proteobacteria. B. A chimera involving an alpha-Proteobacteria and a Crenarchaeota. C. Eukaryotes evolved earlier than Archaea and Bacteria and their genes have few prokaryotic homologs. D. A chimera involving an Archaebacteria and an Actinobacteria. E. A chimera involving an alpha-Proteobacteria and a Thermoplasma.

14. In mass spectrometry, which of the following database information listed below is used to compare instrument data, in order to find hits in genomic sequence information? A. Nucleotide sequence symbols, e.g. ATGC B. Protein sequence symbols, e.g. LTIVRW C. Theoretical database of proteins digested with trypsin into peptide fragments scored with BLOSUM matrices D. Theoretical mass spectral peaks derived from theoretical tryptic digests of sequence database information and predicted fragmentation patterns. E. Chemical formula and masses of amino acids in database proteins.

15. Which three amino acids are most readily oxidized and thus have altered chemistry in mass spectrometry preparations, hence requiring the search algorithm to know the mass of the oxidized forms? A. Ile, Leu, Val B. Trp, Tyr, Phe C. Arg, Lys, His D. Trp, Met, Cys E. Gly, Ala, Ser

16. Post-translational modifications can be detected with mass spectrometry, provided the search algorithm knows about the alteration to the amino acid and peptide mass peaks. Which posttranslational modifications would be the most difficult to spot with LC/MS-MS tandem mass spectrometry? A. Trypsin cleavage B. Phosphorylation C. Myristylation D. Glycosylation E. Signal peptide cleavage

17. Why is full Dynamic Programming (DP) not used to perform Multiple Sequence Alignment (MSA) most of the time? A. DP can handle only pairwise sequence alignment problems. B. DP is too slow for most MSA problems. C. Heuristic methods can produce better quality MSAs. D. Heuristic methods can produce MSAs that are as good as DP, and they are much faster. E. None of the above.

18. Which of the following step(s) is/are NOT part of the progressive method of MSA? A. Building pairwise alignments for all sequences. B. Building similarity matrix of all sequences. C. Building phylogenetic tree of all sequences. D. Progressively building MSA according to the guide tree. E. A and C.

19. Given a transcription factor family X, which of the following databases should you use to find its conserved peptide motifs? A. JASPAR B. TRANSFAC C. TFmotif D. Prosite E. ClustalW

20. Which of the following statement(s) on Position Specific Scoring Matrix (PSSM) is/are CORRECT? A. Compared to PSSM32, PSSM65 is generated from more conserved sequences. B. Compared to PSSM65, PSSM32 is generated from more conserved sequences. C. PSSMs are widely used in Pfam. D. PSSMs are widely used in Prosite. E. B and D.

SECTION B (20 marks) Structured Questions Answer any TWO Questions. 1. A strange plant was found growing outside of a laboratory where scientists are engineering a new bacterial strain optimized for lipid production for biofuels. The plant grows rapidly and spreads along the ground. Its stem seems to be too weak to support its broad leaves, and the stems are fragile. If you step on the stem, it dies quickly as if it was cut off, and the leaves dry up. A discarded cigarette butt caused one of these plants to ignite into flames, suggesting the dead plant is a fire hazard. A microscopic section of the stem reveals an unusual bacterial endosymbiont growing inside the plant. The hypothesis is that this strange new flammable plant is newly evolved from a local plant that acquired a biofuel bacterial endosymbiont, escaped from the laboratory. You are assigned the task to determine whether the plant has acquired this strange endosymbiont from one of the biofuel bacterial strains. You undertake genomic and proteomic analysis of the plant stem extract. Here are the samples you have analyzed: Shotgun Sequencing of DNA extract of: 1 - the plant stem with the endosymbiont 2 - the plant leaf lacking the endosymbiont 3 - two candidate local plants with similar leaves. Mass Spectrometry Proteomic Analysis of: 4 - proteins extracted from the plant stem with the endosymbiont 5 - proteins extracted from the biofuel bacterial Within GenBank nucleotide and protein sequence databases, you know that: 6 - there is no genomic sequence for any closely similar plants in GenBank. 7 - the host bacterial strain sequence used by the laboratory is in GenBank 8 - the seven enzymes that have been artificially introduced into the biofuel bacterium are in Genbank and you know their accession numbers Given the information above, answer all of the following questions: a. Positive identification of the biofuel strain in sample 1 requires evidence from the DNA sequence information. Which samples would you compare and with what tool? What would be proof of positive identification? (2 marks) b. Similarly, positive identification of the engineered biofuel strain in sample 4 requires mass spectrometry matches to which of the specific information listed above? (1 mark) c. Positive identification of the original plant species in sample 2 requires comparisons with which samples? (1 mark)

d. Can you estimate the number of nucleotide base changes across 100 base pairs of DNA you would expect for a recently evolved species variant to differ from the starting species where time is approximately 1 year? In other words, how closely do you expect most genes in the original plant species to match genes in the new plant species at the nucleotide level? (1 mark) e. Reconstruction of the endosymbiont adapted bacteria may reveal differences from the laboratory strain. You need to reconstruct the endosymbiont genome. Assume that you contig the shotgun pieces first, as much as possible. List three ways by which can you tell which DNA shotgun sequence in sample 1 comes from bacteria or plant? (3 marks) f. Evolutionary changes between the endosymbiont adapted bacteria and the original laboratory change may be related to the bacteria acquiring new genes as xenologs. For example, there may be an additional enzyme acquired from another bacteria that confers an ability for the endosymbiont to utilize complex carbohydrates within the plant stem for the creation of biofuel lipids. Explain how would you find candidate xenologs given the reconstructed endosymbiont genome? (2 marks)

2. In the area of sequence analysis: a. What are the distinguishing features of the three different "central dogmas" that represent the differences in genome and proteome information between viruses, prokaryotes and eukaryotes? (4 marks) b. Explain how BLAST nucleotide searching strategies will differ in the search for similar protein encoding genes in genomic sequence in bacteria versus the search for similar protein encoding genes in genomic sequences in metazoan eukaryotes? (2 marks) c. What do you think are the evolutionary advantage for viruses in retaining massive single genes that are translated as a single polypeptide and cleaved into functional proteins by a protease? How might this reduce the number of genes the virus needs to carry around with it? (2 marks) d. Explain how you would make a BLAST searchable database of independently processed (i.e. protease cleaved) viral proteins from viral genomic sequences. (2 marks)

3. We want to find the best alignment(s) between the 2 DNA sequences, AATGTC and AGCTC. The scoring scheme S is defined as follows: S(i,i) = 10, S(i,j) = 5 if i and j are both purines, or both pyrimidines, and S(i,j) = 0 if i is a purine and j is a pyrimidine, or if i is a pyrimidine and j is a purine. There are no gap penalties. Find the score Sbest, the number N of optimal alignments as well as all optimal alignments for these two sequences (show your final dynamic programming matrix for full credit). (10 marks)

4. The figures below show 3 proteins, named A, B, and C, and 3 Ramachandran plots, named R1, R2, and R3. Find the correspondence between proteins and Ramachandran plots (note that each of the three Ramachandran plots was actually computed from one of the three proteins). Explain how you obtain this correspondence. (10 marks)

~ End of Paper ~

You might also like