Professional Documents
Culture Documents
Quality
Control
of
High-Throughput
Sequencing
data
Systema;c
pipeline
QC
Quality
->
exclude
failed
runs
/
poorly
constructed
libraries
Iden;ty
->
detect
contamina;on
and
swaps
LIMITATIONS
OF
TECHNOLOGY
Typical
limita;ons
of
technology
Data
Test
sample
NA12878,
Illumina
HiSeq
WGS
List
of
gene
intervals
of
interest
GC-rich
genes
High
GC
are
badly
Betaine
rescues
intervals
covered
high-GC
genes
(x>0.6)
by
new
protocol
without
betaine
Med.
GC
Intervals
(0.4<x<0.6)
NormalizaGon:
X
Norm(X)
=
Mean(x)
Visual
inspec;on
of
select
intervals
shows
dieren;al
eects
Old protocol
Coverage
in
GC-rich
regions
increases
with
betaine
concentra;on
New
protocol
+1M
betaine
Data
Test
sample
NA12878,
Illumina
HiSeq
WEx
List
of
exome
target
intervals
of
interest
Tech
2
performs
badly
in
the
area
where
Tech
1
also
fails;
failure
is
more
obvious!
Tech
2
exon
Tech
1
interval
Tech2
interval
extends
to
the
intron
(250
bp
upstream,
also
void
of
coverage)
Caveat:
raw
sequence
data
displayed
here
are
by
deni;on
not
normalized,
so
comparisons
should
be
limited
to
rela;ve
amounts
of
coverage
between
areas
per
technology,
rather
than
absolute
amounts
between
technologies.
Loca;on
of
bait
sets
plays
drama;c
role
in
exome
usability
Tech
1
provided
decent
coverage
so
sequence
context
is
not
the
Tech
1
problem
Tech
2
produces
bad
coverage
in
Tech
2
Tech
2
produces
abundant
coverage
in
the
intron
region
area
of
interest
intron exon
Tech 1 interval
Tech
2
interval
Caveat:
raw
sequence
data
displayed
here
are
by
deni;on
not
normalized,
so
comparisons
should
be
limited
to
rela;ve
amounts
of
coverage
between
areas
per
technology,
rather
than
absolute
amounts
between
technologies.
QC
of
workows
and
algorithms
based
on
benchmarking
Data
that
fail
any
step
of
quality
or
idenGty
vericaGon
should
get
blacklisted
Controlling
for
contamina;on
and
sample
swaps
Normal
amounts
of
raw
data
(in
Gb)
but
poor
target
coverage
High
propor;on
of
chimerism
Strange
insert
size
distribu;on
(too
big
/
too
small)
Shearing-based
oxida;on
(poor
OxoQ
values)
Library
size
too
small
%
PF
reads
%
PF
reads
(pass
lters)
aligned
Fewer
than
80%
reads
produced
(that
pass
93.031
78.435
quality
lters)
were
mapped;
typical
alignment
93.378
74.277
for
human
exome
is
>
98%
High
percentage
of
duplicate
reads
Appear
to
reach
coverage
target
but
values
are
inated
by
duplica;on
Showing
duplicate
reads
Hiding
duplicate
reads
Uneven
coverage
in
a
PCR-Free
whole
genome
Unevenly
covered
WGS
sample
Evenly
covered
WGS
sample
Read
Group
1
2
3
Uneven
coverage
between
read
groups
Not
always
a
problem
some;mes
we
add
an
extra
run
for
a
sample
to
top
up
coverage
(but
in
this
case
RG2
in
par;cular
looks
problema;c)
High
percentage
of
chimerism
Reaches
coverage
goals
but
data
integrity
may
be
an
issue
as
number
of
chimeric
reads
is
so
high;
could
confound
detec;on
of
structural
rearrangements
and
indels.
Strange
insert
size
distribu;on
Abnormal spike
Further
reading
hpp://www.broadins;tute.org/gatk/guide/
hpps://broadins;tute.github.io/picard/index.html
hpps://broadins;tute.github.io/picard/picard-metric-deni;ons.html