You are on page 1of 63

Differential expression analysis

for sequencing count data


Simon Anders
RNA-Seq
Two applications of RNA-Seq
Discovery
find new transcripts
find transcript boundaries
find splice junctions
omparison
!iven samples from different experimental conditions" find effects
of t#e treatment on
gene expression strengt#s
isoform abundance ratios" splice patterns" transcript
boundaries
Alignment
S#ould one align against t#e genome or t#e
transcriptome$
against transcriptome
easier" because no gapped alignment necesssary
but%
ris& to miss possible alignments'
ount data in (TS
RNA-Seq
Tag-Seq
Gene GliNS1 G144 G166 G179 CB541 CB660
13CDNA73 4 0 6 1 0 5
A2BP1 19 18 20 7 1 8
A2M 2724 2209 13 49 193 548
A4GALT 0 0 48 0 0 0
AAAS 57 29 224 49 202 92
AACS 1904 1294 5073 5365 3737 3511
AADACL1 3 13 239 683 158 40
[...]
#)*-Seq
+ar-Seq
,,,
ounting rules
ount reads" not base-pairs
ount eac# read at most once,
Discard a read if
it cannot be uniquely mapped
its alignment overlaps wit# several genes
t#e alignment quality score is bad
-for paired-end reads. t#e mates do not map to t#e same gene
Normalisation for library si/e
)f sample A #as been sampled deeper t#an sample
+" we expect counts to be #ig#er,
Simply using t#e total number of reads per sample
is not a good idea0 genes t#at are strongly and
differentially expressed may distort t#e ratio of total
reads,
+y dividing" for eac# gene" t#e count from sample A
by t#e count for sample +" we get one estimate per
gene for t#e si/e ratio or sample A to sample +,
1e use t#e median of all t#ese ratios,
Normalisation for library si/e
Normalisation for library si/e
2ffect si/e and significance
Sample-to-sample variation
comparison of
two replicates
comparison of
treatment vs control
T#e *oisson distribution
T#is bad contains very many
small balls" 345 of w#ic# are red,
Several experimenters are tas&ed
wit# determining t#e percentage
of red balls,
2ac# of t#em is permitted to
draw 64 balls out of t#e bag"
wit#out loo&ing,
T#e *oisson distribution turns up w#enever t#ings
are counted
2xample% A s#ort" lig#t rain s#ower wit# r drops7m
6
,
1#at is t#e probability to find k drops on a paving
stone of si/e 3 m
6
$
8 7 64 9 3:5
3 7 64 9 :5
6 7 64 9 345
4 7 64 9 45
; 7 344 9 ;5
34 7 344 9 345
< 7 344 9 <5
33 7 344 9 335
*oisson distribution
)f p is t#e proportion of red balls in t#e bag" and we draw n balls"
we expect 9pn balls to be red,
T#e actual number k of red balls follows a Poisson distribution"
and #ence k varies around its expectation value wit# standard
deviation .
=ur estimate of t#e proportion p9k7n #ence #as t#e expected
value 7n9p and t#e standard error
>p 9 7 n 9 p / . T#e relative error is >p/p = 3 / .
balls drawn expected number relative error of
of red balls estimate
64 6 372 = 71%
100 10 1/10 = 32%
?
*oisson distribution% ounting uncertainty
expected number standard deviation relative error in estimate
of red balls of number of red balls for fraction of red balls
34 34 9 8,6 3734 9 83,@5
344 344 9 34,4 37344 9 34,45
3"444 3"444 9 83,@ 373"444 9 8,65
34"444 34"444 9 344,4 3734"444 9 3,45

Aor *oisson-distributed data" t#e variance is equal
to t#e mean,
(ence" no need to estimate t#e variance
according to several aut#ors% Barioni et al, -644<." 1ang et al, -6434."
+loom et al, -644C." Dasows&i et al, -6434." +ullard et al, -6434.
Really$
)s (TS count data *oisson-distributed$
To sort t#is out" we #ave to distinguis# two sources
of noise,
S#ot noise
onsider t#is situation%
Several flow cell lanes are filled wit# aliquots of t#e same
prepared library,
T#e concentration of a certain transcript species is exactly t#e
same in eac# lane,
1e get t#e same total number of reads from eac# lane,
Aor eac# lane" count #ow often you see a read from
t#e transcript, 1ill t#e count all be t#e same$
S#ot noise
onsider t#is situation%
Several flow cell lanes are filled wit# aliquots of t#e same
prepared library,
T#e concentration of a certain transcript species is exactly t#e
same in eac# lane,
1e get t#e same total number of reads from eac# lane,
Aor eac# lane" count #ow often you see a read from
t#e transcript, 1ill t#e count all be t#e same$
=f course not, 2ven for equal concentration" t#e
counts will vary, T#is theoretically unavoidable
noise is called shot noise,
S#ot noise
S#ot noise% T#e variance in counts t#at persists
even if everyt#ing is exactly equal, -Same as t#e
evenly falling rain on t#e paving stones,.
Stoc#astics tells us t#at s#ot noise follows a Poisson
distribution,
T#e standard deviation of s#ot noise can be
calculated% it is equal to t#e square root of t#e
average count,
Sample noise
Now consider
Several lanes contain samples from biological
replicates,
T#e concentration of a given transcript varies
around a mean value wit# a certain standard
deviation,
T#is standard deviation cannot be calculated" it #as
to be estimated from t#e data,
Differential expression% Two questions
Assume you use RNA-Seq to determin t#e concentration of
transcripts from some gene in different samples, 1#at is your
question$
3, E)s t#e concentration in one sample different from
t#e expression in anot#er sample$F
or
6, Ean t#e difference in concentration between
treated samples and control samples be attributed to
t#e treatment$F
Ean t#e difference in concentration between treated samples
and control samples be attributed to t#e treatment$F
Goo& at t#e differences between replicates$ T#ey s#ow #ow
muc# variation occurs wit#out difference in treatment,
ould it be t#at t#e treatment #as no effect and t#e difference
between treatment and control is just a fluctuation of t#e same
&ind as between replicates$
To answer t#is" we need to assess t#e strengt# of t#is sample
noise,
Summary% Noise
1e distinguis#%
S#ot noise
unavoidable" appears even wit# perfect replication
dominant noise for wea&ly expressed genes
Tec#nical noise
from sample preparation and sequencing
negligible -if all goes well.
+iological noise
unaccounted-for differenced between samples
Dominant noise for strongly expressed genes
c
a
n

b
e
c
o
m
p
u
t
e
d
n
e
e
d
s

t
o

b
e

e
s
t
i
m
a
t
e
d
f
r
o
m

t
#
e

d
a
t
a
Replicates
Two replicates permit to
globally estimate variation
Sufficiently many replicates permit to
estimate variation for eac# gene
randomi/e out un&nown covariates
spot outliers
improve precision of expression and fold-c#ange
estimates
Replication at w#at level$
Replicates s#ould differ in all aspects in w#ic# control
and treatment samples differ" except for t#e actual
treatment,
2stimating noise from t#e data
)f we #ave many replicates" we can estimate t#e
variance for eac# gene,
1it# only few replicates" we need an additional
assumption, 1e use% E!enes wit# similar
expression strengt# #ave similar variance,F
Hariance calculated from comparing two replicates
*oisson v 9 I
*oisson J constant H v 9 I J K I
6
*oisson J local regression v 9 I J f-I
6
.
Hariance depends strongly on t#e mean
Tec#nical and biological replicates
Nagala&s#mi et al. -644<. #ave found t#at
counts for t#e same gene from different technical
replicates #ave a variance equal to t#e mean
-*oisson.,
counts for t#e same gene from different biological
replicates #ave a variance exceeding t#e mean
-overdispersion.,
Barioni et al. -644<. #ave loo&ed confirmed t#e first
fact -and confused everybody by ignoring t#e second
fact.,
Tec#nical and biological replicates
RNA-Seq of yeast LNagala&s#mi et al" 644<M
biological replicates
tec#nical replicates
*oisson noise
T#e negative-binomial distribution
A commonly used generali/ation of t#e *oisson distribution
wit# two parameters
T#e N+ distribution from a #ierarc#ical model
+iological sample
wit# mean and
variance v
*oisson distribution
wit# mean q and
variance q,
Negative binomial
wit# mean and
variance q+v,
Testing% Null #ypot#esis
Bodel%
T#e count for a given gene in sample j come from
negative binomial distributions wit# t#e mean s
j
I
N

and variance s
j
I
N
J s
j


v-I
N
.,
Null #ypot#esis%
T#e experimental condition r #as no influence on
t#e expression of t#e gene under consideration%
I
N
3
9 I
N
6
s
j
relative si/e of library j
I
N
mean value for condition N
v-I
N
. fitted variance for mean I
N
Bodel fitting
2stimate t#e variance from replicates
Ait a line to get t#e variance-mean dependence v-I.
-local regression for a gamma-family generali/ed linear model" extra mat#
needed to #andle differing library si/es.
Testing for differential expression
Aor eac# of two conditions" add t#e count from all
replicates" and consider t#ese sums D
iA
and !
i+
as
N+-distributed wit# moments as estimated and
fitted,
T#en" we calculate t#e probability of observing t#e
actual sums or more extreme ones" conditioned on
t#e sum being k
iA
J&
iA
" to get a p value,
-similar to t#e test used in Robinson and Smyt#Os edge"#
Differential expression
RNA-Seq data% overexpression of two different
genes in flies Ldata% Aurlong groupM
Type-) error control
comparison of
two replicates
comparison of
treatment vs control
Two noise ranges
dominating noise $ow to improve power%
s#ot noise -*oisson. deeper sampling
biological noise more biological replicates
Aurt#er use cases
Similar count data appears in
comparative #i*-Seq
barcode sequencing
,,,
and can be analysed wit# &'(eq as well,
onclusions )
*roper estimation of variance between biological
replicates is vital, Psing *oisson variance is
incorrect,
2stimating variance-mean dependence wit# local
regression wor&s well for t#is purpose,
T#e negative-binomial model allows for a powerful
test for differential expression
S, Anders" 1, (uber% EDifferential expression analysis for
sequence count dataF" !enome +iol 11 -6434. R34@
Software -&'(eq. available from +ioconductor
and 2B+G web site,
Alternative splicing
So far" we counted reads in genes,
To study alternative splicing" reads #ave to be
assigned to transcripts,
T#is introduces ambiguity" w#ic# adds uncertainty,
urrent tools -e,g," cufflinks. allow to quantify t#is
uncertainty,
(owever% To assess t#e significance of differences
to isoform ratios between conditions" t#e
assignment uncertainty #as to be combined wit#
t#e noise estimates,
T#is is not yet possible wit# existing tools,
Regulation of isoform abundance ratios
)n #ig#er eu&aryotes" most genes #ave several
isoforms,
RNA-Seq is better suited t#an microarrays to see
w#ic# isoforms are present in a sample,
T#is opens t#e possibility to study regulation of
isoform abundance ratios" e,g,% )s a given exon
spliced out more often in one tissue type t#an in
anot#er one$

1e will soon release &')(eq" a tool to test for


differential isoform expression in RNA-Seq data,
Data set used for to demonstrate D2QSeq%
&rosophila melanogaster S6 cell cultures%
control -no treatment.%
R biological replicates -6x single end" 6x paired end.
treatment% &noc&-down of pasilla -a splicing factor.
8 biological replicates -3x single end" 6x paired end.
Alternative isoform regulation
Data% +roo&s et al," !enome Res," 6434
2xon counting bins
ount table for a gene
number of reads mapped to each exon (or part of exon) in gene msn:
treated_1 treated_2 control_1 control_2
E01 398 556 561 456
E02 112 180 153 137
E03 238 306 298 226
E04 162 171 183 146
E05 192 272 234 199
E06 314 464 419 331
E07 373 525 481 404
E08 323 427 475 373
E09 194 213 273 176
E10 90 90 530 398 <--- !
E11 172 207 283 227
E12 290 397 606 368 <--- ?
E13 33 48 33 33
E14 0 33 2 37
E15 248 314 468 287
E16 554 841 1024 680
[...]
Bodel
T#e expected count rate for exon l of gene * in sample
j can be modelled as product of
t#e baseline -control. expression strengt# of gene i
t#e fraction of t#e reads from gene i t#at overlap
wit# exon l under control condition
t#e effect of t#e treatment of sample j on t#e
expression strengt# of gene *
t#e effect of t#e treatment of sample j on t#e
fraction of t#e reads from gene i t#at overlap wit#
exon l
t#e sequencing dept# -normali/ation factor. of
sample j
Bodel
counts in gene i"
sample j" exon l
dispersion
si/e factor
expression strengt#
in control
fraction of reads
falling onto exon l
in control
c#ange to fraction of
reads for exon l due
to treatment
c#ange in expression
due to treatment
Bodel" refined
counts in gene i"
sample j" exon l
dispersion
si/e factor
expression strengt#
in sample j
fraction of reads
falling onto exon l
in control
c#ange to fraction of
reads for exon l due
to treatment
furt#er refinement% fit an extra factor for library type -paired-end vs single.
Dispersion estimation
Standard maximum-li&eli#ood estimates for
dispersion parameters #ave very strong bias in case
of small sample si/e,
A met#od-of-moments estimator -as used in &'(eq.
cannot be used due to crossed factors,
1e ta&e over t#e solution from t#e new edge"
version% ox-Reid conditional-maximum-li&eli#ood
estimation
LedgeR% Robinson" Bcart#y" Smyt# -6434.M
Dispersion estimation
Small sample si/e" so some data s#aring is necessary
to get power,
one value fits all$
one value for eac# gene$
one value for eac# exon$
Dispersion vs mean
RpS3Ra -A+gn444RR48.
D2QSeq
combination of *yt#on scripts and an R pac&age
*yt#on script to get counting bins from a !TA file
*yt#on script to get count table from SAB files
R functions to set up model frames and perform
!GB fits and AN=D2H
R functions to visuali/e results and compile an
(TBG report
nearly ready for release
onclusion ))
ounting wit#in exons and N+-!GBs allows to
study isoform regulation,
*roper statistical testing allows to see w#et#er
c#anges in isoform abundances are just random
variation or may be attributed to c#anges in tissue
type or experimental condition,
Testing on t#e level of individual exons gives power
and mig#t be #elpful to study t#e mec#anisms of
alternative isoform regulation,
D2QSeq is nearly ready for release,
Ac&nowledgements
oaut#ors%
Alejandro Reyes
1olfgang (uber
Aunding%
2B+G
Advertisement
(TSeq
A *yt#on pac&age to process
and analyse (TS data
(TSeq% Aeatures
A framewor& to process and analyse #ig#-
t#roug#put sequencing data wit# *yt#on
Simple but powerful interface
Aunctionality to read" statistically analyse" transform
sequences" reads" alignment
onvenient #andling of position-specific data suc#
as coverage vectors" or gene and exon positions
1ell documented" wit# examples for common use
cases,
)n-#ouse support
(TSeq% Typical use cases
Analyse base composition and quality scores for
quality assessment of a read
Trim of adapters in snRNA-Seq
alculate coverage vectors for #)*-Seq
Assign reads to genes to get count data from RNA-
Seq -incl, #andling of spliced reads" overlapping
genes" ambiguous maps" etc,.
Split reads according to multiplex tags
etc,
Suality assessment wit# (TSeq
(TSeq% Availability
(TSeq is available from
#ttp%77www-#uber,embl,de7users7anders7(TSeq
Testers wanted

You might also like