You are on page 1of 23

WEBVTT

1
00:00:00.000 --> 00:00:03.152
This lecture is about why we study
genomics and what it can teach us.

2
00:00:03.152 --> 00:00:08.150
So genomics is the study of
the genomes inside of us.

3
00:00:08.150 --> 00:00:10.600
Let's talk about human genomics.

4
00:00:10.600 --> 00:00:14.490
Everybody on the planet has a genome
that has governed their development and

5
00:00:14.490 --> 00:00:16.770
governs a lot of their biology, and

6
00:00:16.770 --> 00:00:21.030
as you can see by looking at any crowd
of people, we all look really different.

7
00:00:21.030 --> 00:00:24.620
However we've discovered through
sequencing in recent years

8
00:00:24.620 --> 00:00:28.590
that we're actually 99.9% identical or
even more than that.

9
00:00:28.590 --> 00:00:33.005
So it's really remarkable how
much diversity you can create

10
00:00:33.005 --> 00:00:38.020
from a very small number
of changes in your genome.

11
00:00:38.020 --> 00:00:42.484
But of course, now that we know that we're
99.9% identical, we still want to know,

12
00:00:42.484 --> 00:00:45.055
what is it that's driving
all these differences?

13
00:00:45.055 --> 00:00:47.110
Why is one person tall and
another person short?

14
00:00:47.110 --> 00:00:52.230
Why does one person live to be 100
an another person lives to be not 100?

15
00:00:52.230 --> 00:00:55.109
Why does one person get cancer and
another person not?

16
00:00:55.109 --> 00:00:59.084
Many of these things we suspect
are driven by our genomes, and

17
00:00:59.084 --> 00:01:00.925
we want to understand that.

18
00:01:00.925 --> 00:01:05.853
So, another thing, one of the most
basic things that our genome determines

19
00:01:05.853 --> 00:01:08.060
is how our bodies develop.

20
00:01:08.060 --> 00:01:11.709
We start off, as you all know, we start
off as a single cell which divides into

21
00:01:11.709 --> 00:01:15.362
a few apparently identical cells, but
that quickly divides into an embryo,

22
00:01:15.362 --> 00:01:17.408
and eventually grows into a whole person.

23
00:01:17.408 --> 00:01:20.904
And somehow that entire program of
development is encoded in our genome, and

24
00:01:20.904 --> 00:01:23.178
this is something that we don't yet
understand.

25
00:01:23.178 --> 00:01:24.015
In addition,

26
00:01:24.015 --> 00:01:28.620
the code in our cells determines all
the different cell types and for example,

27
00:01:28.620 --> 00:01:33.364
it determines how to make a neuron, which
is a very complicated cell, obviously

28
00:01:33.364 --> 00:01:38.210
a very different kind of cell from say a
skin cell, it does very different things.

29
00:01:38.210 --> 00:01:41.823
And yet the genome inside of a neuron
in your body is identical to the genome

30
00:01:41.823 --> 00:01:43.468
inside of any of your skin cells.
31
00:01:43.468 --> 00:01:47.296
So we want to understand what's going on
in that cell even though it has the same

32
00:01:47.296 --> 00:01:51.240
program, the same code, somehow it's
executing a different program to make it

33
00:01:51.240 --> 00:01:52.877
into a neuron versus a skin cell.

34
00:01:52.877 --> 00:01:56.750
Another big area of research
in genomics is cancer.

35
00:01:56.750 --> 00:02:01.510
So cancer is essentially
a genetic disease, we know now.

36
00:02:01.510 --> 00:02:05.220
Cancer cells are simply, again, cells in
your body that have the same genetic code,

37
00:02:05.220 --> 00:02:08.140
the same genome in them, but
somehow they've gone haywire,

38
00:02:08.140 --> 00:02:10.840
and they've started
replicating without control.

39
00:02:10.840 --> 00:02:12.560
That's what makes something cancerous.

40
00:02:12.560 --> 00:02:15.970
Basically it's cells that are dividing
without any check on their division.
41
00:02:15.970 --> 00:02:19.640
And, in fact, we define cancers by
the type of cell that started the cancer.

42
00:02:19.640 --> 00:02:23.700
So there's skin cancer, where a skin
cell starts dividing without control.

43
00:02:23.700 --> 00:02:25.370
It's also called melanoma.

44
00:02:25.370 --> 00:02:26.500
There's lung cancer.

45
00:02:26.500 --> 00:02:28.500
There's blood cancers
that are called leukemia.

46
00:02:28.500 --> 00:02:32.459
These are all defined by the cells
that started the cancer out and

47
00:02:32.459 --> 00:02:35.169
they all have a common phenotype, that is,

48
00:02:35.169 --> 00:02:39.658
they all have a common feature that
they're dividing without control.

49
00:02:39.658 --> 00:02:44.604
But the consequence of different cancers
are very different, and in fact,

50
00:02:44.604 --> 00:02:49.475
the mutations in our DNA that cause
these cells to become cancerous are also
51
00:02:49.475 --> 00:02:50.340
different.

52
00:02:52.530 --> 00:02:54.982
So what do our genes have
to do with any of this?

53
00:02:54.982 --> 00:02:56.040
So what I'm talking about,

54
00:02:56.040 --> 00:03:00.670
I just mentioned the word mutation,
a mutation is a change in your genome.

55
00:03:00.670 --> 00:03:04.780
And that can happen because
your DNA is damaged,

56
00:03:04.780 --> 00:03:07.060
it can happen because of
an accident in replication.

57
00:03:07.060 --> 00:03:10.055
So every time your cells divide,
to explain that latter point,

58
00:03:10.055 --> 00:03:13.670
every time your cells divide,
the entire genome has to be copied.

59
00:03:13.670 --> 00:03:16.190
And our cells are really,
really good at this, fortunately,

60
00:03:16.190 --> 00:03:17.720
otherwise we wouldn't exist.

61
00:03:17.720 --> 00:03:21.740
We wouldn't survive for very long, but
once in a while, they make an error,

62
00:03:21.740 --> 00:03:25.985
probably only one to three
errors per cell division.

63
00:03:25.985 --> 00:03:29.520
And once in a while, that error
causes something bad to happen, and

64
00:03:29.520 --> 00:03:33.300
we believe a lot of cancers are caused
by these sort of accidental errors.

65
00:03:33.300 --> 00:03:37.990
And understanding that is a matter
of understanding, well okay,

66
00:03:37.990 --> 00:03:42.390
my cell makes an error,
what does it mean for a mutation or

67
00:03:42.390 --> 00:03:46.160
an error in replication
to turn a cell cancerous?

68
00:03:46.160 --> 00:03:50.330
What usually we think happens is
that that mutation effects a gene

69
00:03:50.330 --> 00:03:53.500
which now doesn't function properly and
that gene, for example,

70
00:03:53.500 --> 00:03:56.950
that might be a gene that
controls cell division, and
71
00:03:56.950 --> 00:03:59.610
now you've sort of turned off
the check on cell division.

72
00:03:59.610 --> 00:04:02.050
And now the cell starts replicating
without control and you have a cancer.

73
00:04:02.050 --> 00:04:04.720
So that's the kind of thing we're
looking at when we're using genomics to

74
00:04:04.720 --> 00:04:05.570
study cancer.

75
00:04:06.690 --> 00:04:07.950
So how does this all work?

76
00:04:07.950 --> 00:04:10.790
So this program that I'm talking
about that's encoded in our DNA.

77
00:04:10.790 --> 00:04:14.420
Well there's something
called the central dogma.

78
00:04:14.420 --> 00:04:18.660
I didn't make that word up, that phrase
was created by Francis Crick and

79
00:04:18.660 --> 00:04:21.990
one of the co-discoverers of the structure
of DNA over fifty years ago.

80
00:04:23.070 --> 00:04:26.500
And it's now still used,
even though as with many dogma,
81
00:04:26.500 --> 00:04:27.770
it's not an absolute dogma.

82
00:04:27.770 --> 00:04:31.760
But the central dogma of biology,
or molecular biology,

83
00:04:31.760 --> 00:04:36.485
says that Information flows in
a single direction from your genome,

84
00:04:36.485 --> 00:04:39.262
that is your DNA, to RNA, to proteins.

85
00:04:39.262 --> 00:04:43.789
And the processes that govern
that we give different names.

86
00:04:43.789 --> 00:04:48.294
So the copying, when DNA is turned into
genes, the first step is you take pieces

87
00:04:48.294 --> 00:04:52.665
of it called exons, and you transcribe
them, that's the copying process,

88
00:04:52.665 --> 00:04:57.036
into RNA, and RNA is essentially an exact
copy of the DNA where all the letters

89
00:04:57.036 --> 00:05:00.399
are the same with the only
difference being the letter t, or

90
00:05:00.399 --> 00:05:03.350
thiamine becomes a letter u,
which is uracil.
91
00:05:03.350 --> 00:05:05.980
But otherwise it's
molecularly the same thing.

92
00:05:05.980 --> 00:05:08.110
That RNA then has to be
turned into a protein.

93
00:05:08.110 --> 00:05:11.970
Now, proteins are not comprised of
these four letters of nucleic acids.

94
00:05:11.970 --> 00:05:16.380
They're comprised of 20 letters that
are called the abbreviations for

95
00:05:16.380 --> 00:05:21.050
amino acids and proteins are also long
molecules, not nearly as long as DNA.

96
00:05:21.050 --> 00:05:23.960
A typical protein might be 300 or
400 amino acids long,

97
00:05:23.960 --> 00:05:26.940
and the way you get a protein
is you take a piece of RNA and

98
00:05:26.940 --> 00:05:31.980
you read it three letters at a time,
and each triplet encodes an amino acid.

99
00:05:31.980 --> 00:05:36.880
And if you think about it for a second
there's four possible RNA nucleotides.

100
00:05:36.880 --> 00:05:41.070
So there's four to the third,
or 64 possible combinations.

101
00:05:41.070 --> 00:05:45.480
Each of those 64 triplets each gets

102
00:05:45.480 --> 00:05:50.200
translated either into amino acid or not.

103
00:05:50.200 --> 00:05:52.220
There's three special
ones called stop codons.

104
00:05:52.220 --> 00:05:53.500
They indicate the end of a protein.

105
00:05:53.500 --> 00:05:58.140
So that's basically how DNA goes and
becomes a protein.

106
00:05:58.140 --> 00:06:00.540
And the proteins kind of do
all the work of your cells.

107
00:06:00.540 --> 00:06:04.650
So the proteins in your body
are what are actually doing most

108
00:06:04.650 --> 00:06:06.900
of the functional work of say,
metabolizing things,

109
00:06:06.900 --> 00:06:10.160
digesting your food,
moving things around in the cells.

110
00:06:10.160 --> 00:06:15.270
So that fundamental dogma has been around
for many decades now, and it more or less

111
00:06:15.270 --> 00:06:20.960
describes how information flows most of
the time from your genome to two proteins.

112
00:06:20.960 --> 00:06:22.940
However, that's not the whole picture,
we now know.

113
00:06:22.940 --> 00:06:27.690
So over time, we've learned that
information can flow the other way,

114
00:06:27.690 --> 00:06:31.280
and as scientists got more
familiar with the whole model,

115
00:06:31.280 --> 00:06:33.490
they realized that it had
to form the other way.

116
00:06:33.490 --> 00:06:36.530
As I was saying a little earlier in this
lecture, there are many different cell

117
00:06:36.530 --> 00:06:39.890
types in your body,
every cell has the same exact DNA.

118
00:06:39.890 --> 00:06:42.870
So if everything just flowed
from the DNA to the proteins,

119
00:06:42.870 --> 00:06:46.500
it would seem sort of fundamentally
impossible for the cells to

120
00:06:46.500 --> 00:06:50.120
behave differently, yet we know that
neurons don't act like skin cells.

121
00:06:50.120 --> 00:06:50.825
So what's going on?

122
00:06:50.825 --> 00:06:54.800
So the proteins themselves, some of
the proteins that are created by the DNA

123
00:06:54.800 --> 00:06:57.880
go back and bind to that DNA stuff and
modify it and

124
00:06:57.880 --> 00:07:00.250
change the genes that get turned on and
off.

125
00:07:00.250 --> 00:07:02.158
So proteins can self regulate in this way.

126
00:07:02.158 --> 00:07:05.820
And there are other things that can
happen with DNA, other modifiers,

127
00:07:05.820 --> 00:07:09.470
some are called methylation marks
that can change DNA as well.

128
00:07:09.470 --> 00:07:13.214
So there are features on the DNA that
are affected by the proteins themselves.

129
00:07:13.214 --> 00:07:17.692
So this feedback loops in the process
in this sort of information flow, and
x
130
00:07:17.692 --> 00:07:21.620
that as a result,
information's actually flowing backwards.

131
00:07:21.620 --> 00:07:22.861
So in the genomics field, so

132
00:07:22.861 --> 00:07:25.463
how do we make these measurements
that I'm talking about?

133
00:07:25.463 --> 00:07:29.277
How do we measure if you want to
understand cancer, then we have to go and

134
00:07:29.277 --> 00:07:33.233
get some cancer cells and figure out
what mutations happen in the cells.

135
00:07:33.233 --> 00:07:34.138
So how do we do that?

136
00:07:34.138 --> 00:07:35.800
Do that with sequencing.

137
00:07:35.800 --> 00:07:39.040
So sequencing is sort of at
the heart of genomics, and

138
00:07:39.040 --> 00:07:42.620
the genomics revolution that we've been
in for about the past 20 years, and

139
00:07:42.620 --> 00:07:46.150
this really accelerated
over the past ten years.

140
00:07:46.150 --> 00:07:50.220
And one reason for this acceleration
is that genome technology has gotten

141
00:07:50.220 --> 00:07:52.500
incredibly fast and efficient.

142
00:07:52.500 --> 00:07:55.420
So what you're looking at here are some
of the latest sequencing machines.

143
00:07:55.420 --> 00:07:58.800
A sequencer today, the highest super
sequencer we have today can sequence in

144
00:07:58.800 --> 00:08:04.120
a single run of the machine,
as many as a trillion nucleotides of DNA.

145
00:08:04.120 --> 00:08:08.240
So to give you a sense of what that means,
the Human Genome Project was started

146
00:08:08.240 --> 00:08:13.170
in 1989 with the goal of sequencing
one human genome in 15 years.

147
00:08:13.170 --> 00:08:16.430
It beat that goal, we actually
published the human genome in 2001, so

148
00:08:16.430 --> 00:08:19.290
in just 12 years we finished the project.

149
00:08:19.290 --> 00:08:21.810
I was part of that project.

150
00:08:21.810 --> 00:08:24.780
And it was a massive effort
involving thousands of scientists

151
00:08:24.780 --> 00:08:25.990
from around the world.

152
00:08:25.990 --> 00:08:27.480
And sequencers were employed at

153
00:08:28.620 --> 00:08:31.730
half a dozen huge genome
sequencing centers in the US, and

154
00:08:31.730 --> 00:08:36.590
large sequencing centers in the UK,
in France, in China, all over the world.

155
00:08:36.590 --> 00:08:40.210
Today you can get
a sequencer in a single lab,

156
00:08:40.210 --> 00:08:44.890
one of these machines run by a single
investigator, and in just a few days,

157
00:08:44.890 --> 00:08:49.150
you can sequence on the order of several
hundred human genome equivalents.

158
00:08:49.150 --> 00:08:53.360
So now we're in maybe a little more than
a dozen years after the completion of

159
00:08:53.360 --> 00:08:54.135
the human genome.

160
00:08:54.135 --> 00:08:57.140
12 year project involving
thousands of scientists.
161
00:08:57.140 --> 00:09:00.330
Now a single scientist in one day can
do far more sequencing than that entire

162
00:09:00.330 --> 00:09:01.630
consortium did.

163
00:09:01.630 --> 00:09:04.947
So that's allowed us to start looking
at things like cancer genomics.

164
00:09:04.947 --> 00:09:09.369
When the human genome was published in
2001, no one at that time thought it was

165
00:09:09.369 --> 00:09:13.922
even remotely feasible to start sequencing
the entire genome of a single tumor, and

166
00:09:13.922 --> 00:09:17.449
yet today, we have literally tens
of thousands of projects going

167
00:09:17.449 --> 00:09:19.659
on around the world doing exactly that.

168
00:09:19.659 --> 00:09:24.148
So the result of that is that we
are generating these enormous,

169
00:09:24.148 --> 00:09:26.400
enormous data sets.

170
00:09:26.400 --> 00:09:30.470
So sure we can sequence all that data,
but what I didn't say was that
171
00:09:30.470 --> 00:09:33.200
towards the end of the Human Genome
Project, when we were at the point where

172
00:09:33.200 --> 00:09:36.280
we were writing the paper, and I was part
of one of the teams that was doing that,

173
00:09:36.280 --> 00:09:39.850
we had hundreds of scientists frantically
trying to analyze all this data from

174
00:09:39.850 --> 00:09:43.620
a single genome and figure out what we
could say about it in a scientific paper.

175
00:09:43.620 --> 00:09:48.250
So today, one investigator, one lab,
can generate multiple genomes

176
00:09:48.250 --> 00:09:52.150
in a space of a week, but that doesn't
mean that in the space of a week, or

177
00:09:52.150 --> 00:09:55.000
a few days, you can analyze all that data,
not at all.

178
00:09:55.000 --> 00:09:59.390
So you need powerful computers running for
days or even weeks just to

179
00:09:59.390 --> 00:10:02.490
churn through the data and turn it into
something that a person can look at.

180
00:10:02.490 --> 00:10:04.610
And there's many different
questions you can ask about it.

181
00:10:04.610 --> 00:10:07.920
One question that I sort of already
alluded to is, you can ask well,

182
00:10:07.920 --> 00:10:11.220
what are the mutations in this cell
versus other cells from the same person?

183
00:10:11.220 --> 00:10:13.410
So that's say,
a kind of question you could ask.

184
00:10:13.410 --> 00:10:17.363
That requires significant amounts of
computing to take that bewildering massive

185
00:10:17.363 --> 00:10:20.685
data and turn into something
comprehensible to a group of scientists

186
00:10:20.685 --> 00:10:21.900
who can then analyze it.

187
00:10:21.900 --> 00:10:25.210
So another thing that's driven this
revolution is not just the efficiency but

188
00:10:25.210 --> 00:10:25.980
the cost.

189
00:10:25.980 --> 00:10:28.140
So the same that things are gotten faster,

190
00:10:28.140 --> 00:10:31.325
and more efficient that way,
they've also got much cheaper.
191
00:10:31.325 --> 00:10:36.171
So this plot that you're looking at now
shows you the rough cost per human genome

192
00:10:36.171 --> 00:10:40.673
equivalent going back to around the time
the human genome was completed.

193
00:10:40.673 --> 00:10:44.569
So when the human genome was finished
in 2001, the scientific community then

194
00:10:44.569 --> 00:10:48.463
proceeded with several other important
mammalian genomes that are about the same

195
00:10:48.463 --> 00:10:52.357
size, such as the mouse genome, and the
cow genome, and these are genomes that,

196
00:10:52.357 --> 00:10:55.771
like human, are around two and
a half to three billion base pairs long.

197
00:10:55.771 --> 00:11:00.420
And those projects cost on the order
of $25 or $30 million to sequence.

198
00:11:00.420 --> 00:11:05.210
So that cost started to drop, from that
point on dropped very rapidly, and

199
00:11:05.210 --> 00:11:08.775
then around 2007, there's an introduction
of a new technology from a company called

200
00:11:08.775 --> 00:11:14.120
Solexa, now called Illumina,
that led to even more rapid drops in cost,

201
00:11:14.120 --> 00:11:18.340
because the sequencing technology
itself changed really dramatically and

202
00:11:18.340 --> 00:11:22.140
we'll talk about that a little
bit later in this course.

203
00:11:22.140 --> 00:11:25.020
But as a result,
the sequencing cost today for

204
00:11:25.020 --> 00:11:27.220
a human genome is on the order of $1000.

205
00:11:27.220 --> 00:11:31.868
So we've gone from $25 to $30
million to $1,000 in the space of

206
00:11:31.868 --> 00:11:33.211
about a dozen years.

207
00:11:33.211 --> 00:11:37.343
And that opens up a world of experiments
that we didn't think were feasible before,

208
00:11:37.343 --> 00:11:40.617
not only because of the time involved but
also because of the cost.

209
00:11:40.617 --> 00:11:42.626
So finally, where is all this data?

210
00:11:42.626 --> 00:11:47.229
So there are now trillions of bases of
data that have already been generated.

211
00:11:48.240 --> 00:11:50.870
You and I can go and
download this data and study it ourselves.

212
00:11:50.870 --> 00:11:55.500
Even though this data has been published
and deposited in public archives,

213
00:11:55.500 --> 00:11:58.230
that doesn't mean that there's
nothing more to learn from it.

214
00:11:58.230 --> 00:12:02.190
The convention in the field is that
once you publish a paper describing

215
00:12:02.190 --> 00:12:04.660
some genomic data set,
you're required to release it, and

216
00:12:04.660 --> 00:12:06.710
generally release it with no restrictions.

217
00:12:06.710 --> 00:12:09.900
So there's a terrific set of
repositories of all this data.

218
00:12:09.900 --> 00:12:15.070
The biggest one is the National Center for
Biotechnology Information or NCBI.

219
00:12:15.070 --> 00:12:18.930
The raw data is deposited there in
something called the Sequence Read Archive

220
00:12:18.930 --> 00:12:20.150
or SRA.

221
00:12:20.150 --> 00:12:23.540
But many more databases are contained
within NCBI that contain, for

222
00:12:23.540 --> 00:12:26.120
example, the names and locations of all

223
00:12:26.120 --> 00:12:29.630
the genes that are present in all
the genomes that we've been sequencing.

224
00:12:29.630 --> 00:12:31.730
So this is a great resource for
people who want to go and

225
00:12:31.730 --> 00:12:34.650
try to make new discoveries,
not only about the human genome, but

226
00:12:34.650 --> 00:12:37.860
about the many other thousands of species
that we're engaged in sequencing.

You might also like