Professional Documents
Culture Documents
Cells
Proteins
Form enzymes that send signals to other cells and regulate gene
activity
Form bodys major components
6
DNA structure
8
http://en.wikipedia.org/wiki/Image:Chromatin_Structures.png
Human chromosomes
RNA
RNA is similar to DNA chemically. It is usually only a single strand.
T(hyamine) is replaced by U(racil)
Several types of RNA exist for different functions in the cell.
http://www.cgl.ucsf.edu/home/glasfeld/tutorial/trna/trna.gif
Transcription
Translation
Is this true?
11
Denis Noble: The principles of Systems Biology illustrated using the virtual heart
http://velblod.videolectures.net/2007/pascal/eccs07_dresden/noble_denis/eccs07_noble_psb_01.ppt
Proteins
12
http://upload.wikimedia.org/wikipedia/commons/c/c5/Amino_acids_2.png
Amino acids
13
14
Proteins
15
Genes
(folding)
MSG
(transcription)
5
a t g a g t g g a
t a c t c a c c t
16
http://fold.it
MSG
MSR
a t g a g t g g a
a t g a g t c g a
t a c t c a c c t
t a c t c a g c t
17
18
5
Introns are removed from RNA after transcription
Exons are joined:
This process is called splicing
19
Alternative splicing
Different splice variants may be generated
3
5
20
21
DNA sequencing
How do we obtain DNA sequence information from organisms?
Genome assembly
What is needed to put together DNA sequence information from sequencing?
23
24
DNA sequencing
25
26
27
28
Problems
29
30
31
32
Overlap
Finding potentially overlapping reads
Layout
Finding the order of reads along DNA
Consensus (Multiple alignment)
Deriving the DNA sequence from the layout
Next, the method is described at a very abstract level, skipping a lot of details
Finding overlaps
acggagtcc
agtccgcgctt
r1
5
a t g a g t g g a
t a c t c a c c t
r2
33
20 reads:
#
1
2
3
4
5
6
7
8
9
10
34
Read
CATCGTCA
CGGTGAAG
TATGCGCA
GACGAGTC
CTGACAAA
ATGCGCAT
ATGCGGTC
CTGCGTGA
GCGTGACG
GTCGGTGA
Read*
TCACGATG
CTTCACCG
TGCGCATA
GACTCGTC
TTTGTCAG
ATGCGCAT
GACCGCAT
TCACGCAG
CGTCACGC
TCACCGAC
#
11
12
13
14
15
16
17
18
19
20
Read
GGTCGGTG
ATCGTGAT
GCGCTGCG
GCATCGTG
AGCGCGCT
GAAGTTGT
AGTGAAAC
ACGCGATG
GCGCATCG
AAGTGAAA
Read*
CACCGACC
ATCACGAT
CGCAGCGC
CACGATGC
AGCGCGCT
ACAACTTC
GTTTCACT
CATCGCGT
CGATGCGC
TTTCACTT
Finding overlaps
Overlap(1, 6) = 3
6 ATGCGCAT
12 ATCGTGAT
35
1 CATCGTCA
Overlap(1, 12) = 7
12
consensus sequence
36
Ambiguous bases
7*
GACCGCAT
6=6* ATGCGCAT
14
GCATCGTG
1
CATCGTGA
12
ATCGTGAT
19
GCGCATCG
13* CGCAGCGC
--------------------CGCATCGTGAT
37
2
CGGTGAAG
10
GTCGGTGA
11
GGTCGGTG
7
ATGCGGTC
--------------------ATGCGGTCGGTGAAG
Reads
Non-overlapping
read
Overlapping reads
=> Contig
In a 3x109 bp genome with 500 bp reads and 5x coverage, there are ~107 reads and
~107(107-1)/2 = ~5x1013 pairwise sequence comparisons
38
39
Recap: if repeat length exceeds read length, we might not get the correct
assembly
This is a problem especially in eukaryotes
~3.1% of genome consists of repeats in Drosophila, ~45% in human
Possible solutions
1.
2.
40
Genome
BAC library
41
Divide-and-conquer
BAC-by-BAC sequencing
42
Hybrid method
43
44
http://en.wikipedia.org/wiki/Drosophila_melanogaster
Two efforts:
Human Genome Project (public
consortium)
Celera (private company)
45
46
47
49
51
SOLiD
Read length 25-30 bp
1-2 Gb / 5-10 day run
>99.94% accuracy / base
>99.999% accuracy / consensus with 15x coverage