Subtitles Why Genomics

WEBVTT
1
00:00:00.000 --> 00:00:03.152
This lecture is about why we study
genomics and what it can teach us.
2
00:00:03.152 --> 00:00:08.150
So genomics is the study of
the genomes inside of us.
3
00:00:08.150 --> 00:00:10.600
Let's talk about human genomics.
4
00:00:10.600 --> 00:00:14.490
Everybody on the planet has a genome
that has governed their development and
5
00:00:14.490 --> 00:00:16.770
governs a lot of their biology, and
6
00:00:16.770 --> 00:00:21.030
as you can see by looking at any crowd
of people, we all look really different.
7
00:00:21.030 --> 00:00:24.620
However we've discovered through
sequencing in recent years
8
00:00:24.620 --> 00:00:28.590
that we're actually 99.9% identical or
even more than that.
9
00:00:28.590 --> 00:00:33.005
So it's really remarkable how
much diversity you can create
10
00:00:33.005 --> 00:00:38.020
from a very small number
of changes in your genome.
11
00:00:38.020 --> 00:00:42.484
But of course, now that we know that we're
99.9% identical, we still want to know,
12
00:00:42.484 --> 00:00:45.055
what is it that's driving
all these differences?
13
00:00:45.055 --> 00:00:47.110
Why is one person tall and
another person short?
14
00:00:47.110 --> 00:00:52.230
Why does one person live to be 100
an another person lives to be not 100?
15
00:00:52.230 --> 00:00:55.109
Why does one person get cancer and
another person not?
16
00:00:55.109 --> 00:00:59.084
Many of these things we suspect
are driven by our genomes, and
17
00:00:59.084 --> 00:01:00.925
we want to understand that.
18
00:01:00.925 --> 00:01:05.853
So, another thing, one of the most
basic things that our genome determines
19
00:01:05.853 --> 00:01:08.060
is how our bodies develop.
20
00:01:08.060 --> 00:01:11.709
We start off, as you all know, we start
off as a single cell which divides into
21
00:01:11.709 --> 00:01:15.362
a few apparently identical cells, but
that quickly divides into an embryo,
22
00:01:15.362 --> 00:01:17.408
and eventually grows into a whole person.
23
00:01:17.408 --> 00:01:20.904
And somehow that entire program of
development is encoded in our genome, and
24
00:01:20.904 --> 00:01:23.178
this is something that we don't yet
understand.
25
00:01:23.178 --> 00:01:24.015
In addition,
26
00:01:24.015 --> 00:01:28.620
the code in our cells determines all
the different cell types and for example,
27
00:01:28.620 --> 00:01:33.364
it determines how to make a neuron, which
is a very complicated cell, obviously
28
00:01:33.364 --> 00:01:38.210
a very different kind of cell from say a
skin cell, it does very different things.
29
00:01:38.210 --> 00:01:41.823
And yet the genome inside of a neuron
in your body is identical to the genome
30
00:01:41.823 --> 00:01:43.468
inside of any of your skin cells.
31
00:01:43.468 --> 00:01:47.296
So we want to understand what's going on
in that cell even though it has the same
32
00:01:47.296 --> 00:01:51.240
program, the same code, somehow it's
executing a different program to make it
33
00:01:51.240 --> 00:01:52.877
into a neuron versus a skin cell.
34
00:01:52.877 --> 00:01:56.750
Another big area of research
in genomics is cancer.
35
00:01:56.750 --> 00:02:01.510
So cancer is essentially
a genetic disease, we know now.
36
00:02:01.510 --> 00:02:05.220
Cancer cells are simply, again, cells in
your body that have the same genetic code,
37
00:02:05.220 --> 00:02:08.140
the same genome in them, but
somehow they've gone haywire,
38
00:02:08.140 --> 00:02:10.840
and they've started
replicating without control.
39
00:02:10.840 --> 00:02:12.560
That's what makes something cancerous.
40
00:02:12.560 --> 00:02:15.970
Basically it's cells that are dividing
without any check on their division.
41
00:02:15.970 --> 00:02:19.640
And, in fact, we define cancers by
the type of cell that started the cancer.
42
00:02:19.640 --> 00:02:23.700
So there's skin cancer, where a skin
cell starts dividing without control.
43
00:02:23.700 --> 00:02:25.370
It's also called melanoma.
44
00:02:25.370 --> 00:02:26.500
There's lung cancer.
45
00:02:26.500 --> 00:02:28.500
There's blood cancers
that are called leukemia.
46
00:02:28.500 --> 00:02:32.459
These are all defined by the cells
that started the cancer out and
47
00:02:32.459 --> 00:02:35.169
they all have a common phenotype, that is,
48
00:02:35.169 --> 00:02:39.658
they all have a common feature that
they're dividing without control.
49
00:02:39.658 --> 00:02:44.604
But the consequence of different cancers
are very different, and in fact,
50
00:02:44.604 --> 00:02:49.475
the mutations in our DNA that cause
these cells to become cancerous are also
51
00:02:49.475 --> 00:02:50.340
different.
52
00:02:52.530 --> 00:02:54.982
So what do our genes have
to do with any of this?
53
00:02:54.982 --> 00:02:56.040
So what I'm talking about,
54
00:02:56.040 --> 00:03:00.670
I just mentioned the word mutation,
a mutation is a change in your genome.
55
00:03:00.670 --> 00:03:04.780
And that can happen because
your DNA is damaged,
56
00:03:04.780 --> 00:03:07.060
it can happen because of
an accident in replication.
57
00:03:07.060 --> 00:03:10.055
So every time your cells divide,
to explain that latter point,
58
00:03:10.055 --> 00:03:13.670
every time your cells divide,
the entire genome has to be copied.
59
00:03:13.670 --> 00:03:16.190
And our cells are really,
really good at this, fortunately,
60
00:03:16.190 --> 00:03:17.720
otherwise we wouldn't exist.
61
00:03:17.720 --> 00:03:21.740
We wouldn't survive for very long, but
once in a while, they make an error,
62
00:03:21.740 --> 00:03:25.985
probably only one to three
errors per cell division.
63
00:03:25.985 --> 00:03:29.520
And once in a while, that error
causes something bad to happen, and
64
00:03:29.520 --> 00:03:33.300
we believe a lot of cancers are caused
by these sort of accidental errors.
65
00:03:33.300 --> 00:03:37.990
And understanding that is a matter
of understanding, well okay,
66
00:03:37.990 --> 00:03:42.390
my cell makes an error,
what does it mean for a mutation or
67
00:03:42.390 --> 00:03:46.160
an error in replication
to turn a cell cancerous?
68
00:03:46.160 --> 00:03:50.330
What usually we think happens is
that that mutation effects a gene
69
00:03:50.330 --> 00:03:53.500
which now doesn't function properly and
that gene, for example,
70
00:03:53.500 --> 00:03:56.950
that might be a gene that
controls cell division, and
71
00:03:56.950 --> 00:03:59.610
now you've sort of turned off
the check on cell division.
72
00:03:59.610 --> 00:04:02.050
And now the cell starts replicating
without control and you have a cancer.
73
00:04:02.050 --> 00:04:04.720
So that's the kind of thing we're
looking at when we're using genomics to
74
00:04:04.720 --> 00:04:05.570
study cancer.
75
00:04:06.690 --> 00:04:07.950
So how does this all work?
76
00:04:07.950 --> 00:04:10.790
So this program that I'm talking
about that's encoded in our DNA.
77
00:04:10.790 --> 00:04:14.420
Well there's something
called the central dogma.
78
00:04:14.420 --> 00:04:18.660
I didn't make that word up, that phrase
was created by Francis Crick and
79
00:04:18.660 --> 00:04:21.990
one of the co-discoverers of the structure
of DNA over fifty years ago.
80
00:04:23.070 --> 00:04:26.500
And it's now still used,
even though as with many dogma,
81
00:04:26.500 --> 00:04:27.770
it's not an absolute dogma.
82
00:04:27.770 --> 00:04:31.760
But the central dogma of biology,
or molecular biology,
83
00:04:31.760 --> 00:04:36.485
says that Information flows in
a single direction from your genome,
84
00:04:36.485 --> 00:04:39.262
that is your DNA, to RNA, to proteins.
85
00:04:39.262 --> 00:04:43.789
And the processes that govern
that we give different names.
86
00:04:43.789 --> 00:04:48.294
So the copying, when DNA is turned into
genes, the first step is you take pieces
87
00:04:48.294 --> 00:04:52.665
of it called exons, and you transcribe
them, that's the copying process,
88
00:04:52.665 --> 00:04:57.036
into RNA, and RNA is essentially an exact
copy of the DNA where all the letters
89
00:04:57.036 --> 00:05:00.399
are the same with the only
difference being the letter t, or
90
00:05:00.399 --> 00:05:03.350
thiamine becomes a letter u,
which is uracil.
91
00:05:03.350 --> 00:05:05.980
But otherwise it's
molecularly the same thing.
92
00:05:05.980 --> 00:05:08.110
That RNA then has to be
turned into a protein.
93
00:05:08.110 --> 00:05:11.970
Now, proteins are not comprised of
these four letters of nucleic acids.
94
00:05:11.970 --> 00:05:16.380
They're comprised of 20 letters that
are called the abbreviations for
95
00:05:16.380 --> 00:05:21.050
amino acids and proteins are also long
molecules, not nearly as long as DNA.
96
00:05:21.050 --> 00:05:23.960
A typical protein might be 300 or
400 amino acids long,
97
00:05:23.960 --> 00:05:26.940
and the way you get a protein
is you take a piece of RNA and
98
00:05:26.940 --> 00:05:31.980
you read it three letters at a time,
and each triplet encodes an amino acid.
99
00:05:31.980 --> 00:05:36.880
And if you think about it for a second
there's four possible RNA nucleotides.
100
00:05:36.880 --> 00:05:41.070
So there's four to the third,
or 64 possible combinations.
101
00:05:41.070 --> 00:05:45.480
Each of those 64 triplets each gets
102
00:05:45.480 --> 00:05:50.200
translated either into amino acid or not.
103
00:05:50.200 --> 00:05:52.220
There's three special
ones called stop codons.
104
00:05:52.220 --> 00:05:53.500
They indicate the end of a protein.
105
00:05:53.500 --> 00:05:58.140
So that's basically how DNA goes and
becomes a protein.
106
00:05:58.140 --> 00:06:00.540
And the proteins kind of do
all the work of your cells.
107
00:06:00.540 --> 00:06:04.650
So the proteins in your body
are what are actually doing most
108
00:06:04.650 --> 00:06:06.900
of the functional work of say,
metabolizing things,
109
00:06:06.900 --> 00:06:10.160
digesting your food,
moving things around in the cells.
110
00:06:10.160 --> 00:06:15.270
So that fundamental dogma has been around
for many decades now, and it more or less
111
00:06:15.270 --> 00:06:20.960
describes how information flows most of
the time from your genome to two proteins.
112
00:06:20.960 --> 00:06:22.940
However, that's not the whole picture,
we now know.
113
00:06:22.940 --> 00:06:27.690
So over time, we've learned that
information can flow the other way,
114
00:06:27.690 --> 00:06:31.280
and as scientists got more
familiar with the whole model,
115
00:06:31.280 --> 00:06:33.490
they realized that it had
to form the other way.
116
00:06:33.490 --> 00:06:36.530
As I was saying a little earlier in this
lecture, there are many different cell
117
00:06:36.530 --> 00:06:39.890
types in your body,
every cell has the same exact DNA.
118
00:06:39.890 --> 00:06:42.870
So if everything just flowed
from the DNA to the proteins,
119
00:06:42.870 --> 00:06:46.500
it would seem sort of fundamentally
impossible for the cells to
120
00:06:46.500 --> 00:06:50.120
behave differently, yet we know that
neurons don't act like skin cells.
121
00:06:50.120 --> 00:06:50.825
So what's going on?
122
00:06:50.825 --> 00:06:54.800
So the proteins themselves, some of
the proteins that are created by the DNA
123
00:06:54.800 --> 00:06:57.880
go back and bind to that DNA stuff and
modify it and
124
00:06:57.880 --> 00:07:00.250
change the genes that get turned on and
off.
125
00:07:00.250 --> 00:07:02.158
So proteins can self regulate in this way.
126
00:07:02.158 --> 00:07:05.820
And there are other things that can
happen with DNA, other modifiers,
127
00:07:05.820 --> 00:07:09.470
some are called methylation marks
that can change DNA as well.
128
00:07:09.470 --> 00:07:13.214
So there are features on the DNA that
are affected by the proteins themselves.
129
00:07:13.214 --> 00:07:17.692
So this feedback loops in the process
in this sort of information flow, and
x
130
00:07:17.692 --> 00:07:21.620
that as a result,
information's actually flowing backwards.
131
00:07:21.620 --> 00:07:22.861
So in the genomics field, so
132
00:07:22.861 --> 00:07:25.463
how do we make these measurements
that I'm talking about?
133
00:07:25.463 --> 00:07:29.277
How do we measure if you want to
understand cancer, then we have to go and
134
00:07:29.277 --> 00:07:33.233
get some cancer cells and figure out
what mutations happen in the cells.
135
00:07:33.233 --> 00:07:34.138
So how do we do that?
136
00:07:34.138 --> 00:07:35.800
Do that with sequencing.
137
00:07:35.800 --> 00:07:39.040
So sequencing is sort of at
the heart of genomics, and
138
00:07:39.040 --> 00:07:42.620
the genomics revolution that we've been
in for about the past 20 years, and
139
00:07:42.620 --> 00:07:46.150
this really accelerated
over the past ten years.
140
00:07:46.150 --> 00:07:50.220
And one reason for this acceleration
is that genome technology has gotten
141
00:07:50.220 --> 00:07:52.500
incredibly fast and efficient.
142
00:07:52.500 --> 00:07:55.420
So what you're looking at here are some
of the latest sequencing machines.
143
00:07:55.420 --> 00:07:58.800
A sequencer today, the highest super
sequencer we have today can sequence in
144
00:07:58.800 --> 00:08:04.120
a single run of the machine,
as many as a trillion nucleotides of DNA.
145
00:08:04.120 --> 00:08:08.240
So to give you a sense of what that means,
the Human Genome Project was started
146
00:08:08.240 --> 00:08:13.170
in 1989 with the goal of sequencing
one human genome in 15 years.
147
00:08:13.170 --> 00:08:16.430
It beat that goal, we actually
published the human genome in 2001, so
148
00:08:16.430 --> 00:08:19.290
in just 12 years we finished the project.
149
00:08:19.290 --> 00:08:21.810
I was part of that project.
150
00:08:21.810 --> 00:08:24.780
And it was a massive effort
involving thousands of scientists
151
00:08:24.780 --> 00:08:25.990
from around the world.
152
00:08:25.990 --> 00:08:27.480
And sequencers were employed at
153
00:08:28.620 --> 00:08:31.730
half a dozen huge genome
sequencing centers in the US, and
154
00:08:31.730 --> 00:08:36.590
large sequencing centers in the UK,
in France, in China, all over the world.
155
00:08:36.590 --> 00:08:40.210
Today you can get
a sequencer in a single lab,
156
00:08:40.210 --> 00:08:44.890
one of these machines run by a single
investigator, and in just a few days,
157
00:08:44.890 --> 00:08:49.150
you can sequence on the order of several
hundred human genome equivalents.
158
00:08:49.150 --> 00:08:53.360
So now we're in maybe a little more than
a dozen years after the completion of
159
00:08:53.360 --> 00:08:54.135
the human genome.
160
00:08:54.135 --> 00:08:57.140
12 year project involving
thousands of scientists.
161
00:08:57.140 --> 00:09:00.330
Now a single scientist in one day can
do far more sequencing than that entire
162
00:09:00.330 --> 00:09:01.630
consortium did.
163
00:09:01.630 --> 00:09:04.947
So that's allowed us to start looking
at things like cancer genomics.
164
00:09:04.947 --> 00:09:09.369
When the human genome was published in
2001, no one at that time thought it was
165
00:09:09.369 --> 00:09:13.922
even remotely feasible to start sequencing
the entire genome of a single tumor, and
166
00:09:13.922 --> 00:09:17.449
yet today, we have literally tens
of thousands of projects going
167
00:09:17.449 --> 00:09:19.659
on around the world doing exactly that.
168
00:09:19.659 --> 00:09:24.148
So the result of that is that we
are generating these enormous,
169
00:09:24.148 --> 00:09:26.400
enormous data sets.
170
00:09:26.400 --> 00:09:30.470
So sure we can sequence all that data,
but what I didn't say was that
171
00:09:30.470 --> 00:09:33.200
towards the end of the Human Genome
Project, when we were at the point where
172
00:09:33.200 --> 00:09:36.280
we were writing the paper, and I was part
of one of the teams that was doing that,
173
00:09:36.280 --> 00:09:39.850
we had hundreds of scientists frantically
trying to analyze all this data from
174
00:09:39.850 --> 00:09:43.620
a single genome and figure out what we
could say about it in a scientific paper.
175
00:09:43.620 --> 00:09:48.250
So today, one investigator, one lab,
can generate multiple genomes
176
00:09:48.250 --> 00:09:52.150
in a space of a week, but that doesn't
mean that in the space of a week, or
177
00:09:52.150 --> 00:09:55.000
a few days, you can analyze all that data,
not at all.
178
00:09:55.000 --> 00:09:59.390
So you need powerful computers running for
days or even weeks just to
179
00:09:59.390 --> 00:10:02.490
churn through the data and turn it into
something that a person can look at.
180
00:10:02.490 --> 00:10:04.610
And there's many different
questions you can ask about it.
181
00:10:04.610 --> 00:10:07.920
One question that I sort of already
alluded to is, you can ask well,
182
00:10:07.920 --> 00:10:11.220
what are the mutations in this cell
versus other cells from the same person?
183
00:10:11.220 --> 00:10:13.410
So that's say,
a kind of question you could ask.
184
00:10:13.410 --> 00:10:17.363
That requires significant amounts of
computing to take that bewildering massive
185
00:10:17.363 --> 00:10:20.685
data and turn into something
comprehensible to a group of scientists
186
00:10:20.685 --> 00:10:21.900
who can then analyze it.
187
00:10:21.900 --> 00:10:25.210
So another thing that's driven this
revolution is not just the efficiency but
188
00:10:25.210 --> 00:10:25.980
the cost.
189
00:10:25.980 --> 00:10:28.140
So the same that things are gotten faster,
190
00:10:28.140 --> 00:10:31.325
and more efficient that way,
they've also got much cheaper.
191
00:10:31.325 --> 00:10:36.171
So this plot that you're looking at now
shows you the rough cost per human genome
192
00:10:36.171 --> 00:10:40.673
equivalent going back to around the time
the human genome was completed.
193
00:10:40.673 --> 00:10:44.569
So when the human genome was finished
in 2001, the scientific community then
194
00:10:44.569 --> 00:10:48.463
proceeded with several other important
mammalian genomes that are about the same
195
00:10:48.463 --> 00:10:52.357
size, such as the mouse genome, and the
cow genome, and these are genomes that,
196
00:10:52.357 --> 00:10:55.771
like human, are around two and
a half to three billion base pairs long.
197
00:10:55.771 --> 00:11:00.420
And those projects cost on the order
of $25 or $30 million to sequence.
198
00:11:00.420 --> 00:11:05.210
So that cost started to drop, from that
point on dropped very rapidly, and
199
00:11:05.210 --> 00:11:08.775
then around 2007, there's an introduction
of a new technology from a company called
200
00:11:08.775 --> 00:11:14.120
Solexa, now called Illumina,
that led to even more rapid drops in cost,
201
00:11:14.120 --> 00:11:18.340
because the sequencing technology
itself changed really dramatically and
202
00:11:18.340 --> 00:11:22.140
we'll talk about that a little
bit later in this course.
203
00:11:22.140 --> 00:11:25.020
But as a result,
the sequencing cost today for
204
00:11:25.020 --> 00:11:27.220
a human genome is on the order of $1000.
205
00:11:27.220 --> 00:11:31.868
So we've gone from $25 to $30
million to $1,000 in the space of
206
00:11:31.868 --> 00:11:33.211
about a dozen years.
207
00:11:33.211 --> 00:11:37.343
And that opens up a world of experiments
that we didn't think were feasible before,
208
00:11:37.343 --> 00:11:40.617
not only because of the time involved but
also because of the cost.
209
00:11:40.617 --> 00:11:42.626
So finally, where is all this data?
210
00:11:42.626 --> 00:11:47.229
So there are now trillions of bases of
data that have already been generated.
211
00:11:48.240 --> 00:11:50.870
You and I can go and
download this data and study it ourselves.
212
00:11:50.870 --> 00:11:55.500
Even though this data has been published
and deposited in public archives,
213
00:11:55.500 --> 00:11:58.230
that doesn't mean that there's
nothing more to learn from it.
214
00:11:58.230 --> 00:12:02.190
The convention in the field is that
once you publish a paper describing
215
00:12:02.190 --> 00:12:04.660
some genomic data set,
you're required to release it, and
216
00:12:04.660 --> 00:12:06.710
generally release it with no restrictions.
217
00:12:06.710 --> 00:12:09.900
So there's a terrific set of
repositories of all this data.
218
00:12:09.900 --> 00:12:15.070
The biggest one is the National Center for
Biotechnology Information or NCBI.
219
00:12:15.070 --> 00:12:18.930
The raw data is deposited there in
something called the Sequence Read Archive
220
00:12:18.930 --> 00:12:20.150
or SRA.
221
00:12:20.150 --> 00:12:23.540
But many more databases are contained
within NCBI that contain, for
222
00:12:23.540 --> 00:12:26.120
example, the names and locations of all
223
00:12:26.120 --> 00:12:29.630
the genes that are present in all
the genomes that we've been sequencing.
224
00:12:29.630 --> 00:12:31.730
So this is a great resource for
people who want to go and
225
00:12:31.730 --> 00:12:34.650
try to make new discoveries,
not only about the human genome, but
226
00:12:34.650 --> 00:12:37.860
about the many other thousands of species
that we're engaged in sequencing.

Subtitles Why Genomics

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Subtitles Why Genomics

Uploaded by

Copyright:

Available Formats

WEBVTT

You might also like