You are on page 1of 5

Bio/CS-251

Laboratory 2
Examining A Single DNA Sequence

Jan 31, 2007

For this first lab using the bioinformatics tools that are found on the web we will follow
the last part of Chapter 5 of Bioinformatics for Dummies, henceforth abbreviated as BFD.
The first part of the chapter deals with cleaning up a sequence of DNA that a
microbiologist may have collected in the lab and also with designing PCR Primers. We
will discuss this latter topic at a future date as we approach our wet lab exercise. For
now, we will borrow a known data sequence from the NCBI web page:
http://www.ncbi.nlm.nih.gov/
The gene that we choose is the mutS/hMSH2 DNA repair gene. In addition to
following the readings and guided steps on pages 138-, we will ask you to answer some
questions related to your findings.
First we give some background on this gene
mutS is the name given to a prokaryotic (bacterial) defender of the genome. (mut is an
abbreviation to reflect the increased rate at which DNA mutations accumulate in cells
that do not have this critical gene).
This gene is universal in that it is found in virtually every organism, both prokaryotic and
eukaryotic.
MSH2 is the name given to the eukaryotic (algae and fungi, plants and animals) version
of this gene (MSH is an abbreviation that means MutS Homolog). The term
homolog means that MSH2 looks and acts like the mutS gene, i.e., its structure is
similar to mutS and it plays a similar role in preventing mutations from occurring.
hMSH2: the prefix h in front of the gene name indicates that it is the human version of the
gene
Before we begin the lab, read Analyzing DNA Composition on page 138 of BFD and
answer these questions.
Q1:

We are analyzing a single sequence of DNA that represents the entire sample of
DNA that was obtained in the hypothetical lab. This sequence obviously
represents only one strand of the DNA that was extracted in the lab. The single
strand of DNA is denoted as cDNA (complementary DNA). How is cDNA
created? HINT: read the preface to Chapter 5 or GOOGLE cDNA.

cDNA is created by using reverse transcriptase. DNA is transcribed into mRNA, which is
matured by adding a poly-A tail, and then reverse transcriptase can then make a
complimentry strand to the RNA. This newly made strand is a complimentary form of
DNA (cDNA).
Q2:

Why is the pairing between guanosine and cytosine nucleotides more stable than
the pairing between adenosine and thymidine?

Bio/CS-251

Laboratory 2
Examining A Single DNA Sequence

Jan 31, 2007

Guanosine and cytosine are more stable because they make three hydrogen bonds with
each other, where adenosine and thymidine only make two hydrogen bonds.
Q3: If we know the G+C count, can we find the frequency of all of the bases in the
sample of DNA that was obtained in the lab? How is this done?
If one knows the G+C count, and because G+C are always paired together, and
adenosine and thymidine are also always paired together, then if you know the
percentage of one pair, the other pair must make up the other percentage out of the total.
For example, if G+C make up 60%, then A+T must make up 40%. Each base is then half
of that percentage because they are each part of a pair, G=30%, C=30%, A=20%,
T=20%.
OK, now on to the lab procedures
Procedure: Collect your sequence from NCBI
Go to the NCBI web site for GenBank given in the URL at the top of this page.
a. From the Search pull down menu, choose Gene
b. In the For window type hMSH2 and click Go
c. Several references to the human versions of this gene are listed. Choose the
second entry, MSH2. Click on this entry
d. You will be taken to a page that contains a variety of information about
research that has been done on this gene. Peruse this page.
Q4: What is the complete name of this entry?
mutS homolog 2, colon cancer, nonpolyposis type 1 (E. coli)

Q5: How many papers have been written about this particular entry? HINT:
You will need to go to PubMed for this information. Follow the Links!
168
Q6: As you scroll down the page you will come to a link to the GenBank page
that contains the DNA sequence itself. How many base pairs long is the
sequence for this entry?
80098
e. Scroll back up to the top of the GenBank page and from the Display pulldown
menu chose FASTA.
f. A new page will appear that contains the name of the entry and the listing of
the nucleotides in sequential order, but in a different format from the one at
the bottom of the previous page. Copy all of this information into a word
document that you will save in your workspace as MSH2.doc. You are now
ready to begin your analysis.
Procedure: follow pages 152 in BFD Counting Words in DNA Sequences

Bio/CS-251

Laboratory 2
Examining A Single DNA Sequence

Jan 31, 2007

Purpose: to find the count for each of the nucleotides found in this sequence and also to
find the count for each of four significant triplets found in the sequence.
g. After you obtain your result that will be formatted like Figure 5-4, copy and
paste it on a new page of your MSH2.doc.
Q7:

What is the total G+C count for this sequence? Why are the
percentages of G and C that are shown so different? Is this a
violation of Chargaffs rules?
G+C= 41.60%
This seems to be a slight violation of Chargaffs rule because the G and C
contents are not complimentary in number. Chargaffs Rule is not exact
though, so the violation is minor, especially since the difference between
the two could just be due to experimental error.
Q8: Give the total count for each of the nucleotides in the strand of
DNA represented by this sequence.
G=21.33% C=20.27%
A=26.08% T=32.32%
Q9: As you will learn, the triplets ATG, TAA, TAG, and TGA can
have a special significance in DNA sequences. What is the
frequency of each of these triplets in the sequence that you just
processed?
ATG=1.68%
TAA=1.94%
TAG=1.46%
TGA=1.84%
Procedure: Follow the instructions on pages 153 154 of BFD
Purpose: To search our sequence for the occurrence of any highly unusual repeat of a
long word (> 3 nucleotides in length)
The people who did the statistical analysis for the program BLAST (which we will begin
using next week) said that it was below any reasonable level of statistical significance
that any sequence of length 11 would be repeated solely by random assignment of the
four letters: A, C, G, or T. Therefore, we may conclude that the repeat of an 11 letter
word is a significant finding in our sequence. We will look for a repeated sequence, but
not push it as far as 11. We will go with 5.
h. Follow the instructions on pages 153 154 using a word length of 5. You will
have to recopy the sequence for MSH2 that you saved in your word document.
NOTE: In instruction 3 there is no link at Codon usage, composition. Just
find that section on the web page and go to instruction 4 on page 153.
Q10: How many 5 letter words are repeated 200 times or more in the
sequence for MSH2?

Bio/CS-251

Laboratory 2
Examining A Single DNA Sequence

Jan 31, 2007

Q12: List (Copy and Paste) these sequence(s).

Procedure:

Using a Dot-Plot to spot long words in a sequence.

Purpose:

To provide a streamlined visual method to perform the task of the previous


procedure.

i. Follow the instructions on pages 155 and 156 of BFD. The web page will not
download with your graph so scroll up so that the entire graph appears on you
screen. Then press ALT and Print Scrn at the same time. This will copy the
window that displays the graph. Paste (Ctrl and V) this on a new page of
your WORD document. Save this document in a folder called Lab 1 on your
H drive. You should also save this completed Lab worksheet in that folder.
Q13:

Does this dot plot show any repeated word of significant length?
Think carefully before you answer this question.
Above it is stated that the length must be 11 in order to be considered
statistically significant. This means that dot plot does not show and words of
significant length.
An example of a repeated sequence with tragic consequences

Bio/CS-251

Laboratory 2
Examining A Single DNA Sequence

Jan 31, 2007

Procedure: Using OMIM (Online Mendelian Inheritance in Man) to examine


a genetic disease caused by repeat sequences
Purpose:

Learn how to navigate OMIM

j. Go to http://www.ncbi.nlm.nih.gov/. Under Search, choose Gene,


and type HD into the search box. Open Link #2, HD. Read the
Summary, and then scroll to the bottom of the page. Under NCBI
Reference Sequences (RefSeq), open the link to the mRNA
sequence (NM_002111), then under Display, choose FASTA.
k. Examine the first six lines of the mRNA, and in the space below, record
a triplet sequence that is repeated in tandem more than 10 times:
Q14: Record your triplet repeat here:
GCA
Q15: How many times is the triplet repeated (how many copies of the
triplet?)
21 times in a row
l. Return to the NCBI Entrez Gene page for the HD gene. Under
Additional Links, select MIM:143100, and open this link to the OMIM
database for the HD gene. You will find that this is a long and detailed
summary of everything that is known about the HD gene and its
pathology. Answer each of the following questions briefly.
Q16: What disease is caused by alterations in the HD gene?
What organ system is affected by this disease?
(You may wish to view the Clinical Synopsis from the Table of
Contents along the left border of the page)
Huntington Disease, Brain
Q17: From the Table of Contents, select Allelic Variants, read this
section, and answer the following question:
What is the molecular genetic basis for the disease? Explain
how repeat sequence variation is responsible for this disease.
CAG is repeated many times inside of the gene Huntingtin which translates
as a polyglutamine repeat in the protein product. This causes the brain to
slowly degenerate, often inducing psychotic and behavioral symptoms.

You might also like