You are on page 1of 25

talks

Quality Control of
High-Throughput Sequencing data

Methods for assessing quality


and detec;ng problems

TYPES OF QC
QC analysis takes many forms

Tech Dev: understanding limita;ons of technology


Library prepara;on protocols
Sequencing technologies
Data processing workows & algorithms

Systema;c pipeline QC
Quality -> exclude failed runs / poorly constructed libraries
Iden;ty -> detect contamina;on and swaps

LIMITATIONS OF TECHNOLOGY
Typical limita;ons of technology

Inability to capture data due to genome loca;on


and/or sequence composi;on (GC), reference ar;facts
Examples: High GC content, design of exome bait sets
-> Depth of coverage analysis
-> Library size esGmates

Sequencing chemistry/tech error modes
Example: High rates of FP indels in IonTorrent PGM
-> Indel Error Rate esGmaGon

Data processing algorithms


Example: Posi;on-based callers (UG, samtools)
do a poor job calling indels
-> Compare dierent tools/workows
Example of depth of coverage analysis on WGS

Which of several library construc;on protocols provides the


best coverage distribu;on for WGS?
Old protocol
New protocol + varying concentra;ons of betaine

Data
Test sample NA12878, Illumina HiSeq WGS
List of gene intervals of interest

Tools & methods


DepthOfCoverage -> Per-site values aggregated by gene interval
GCContentByInterval -> Per-gene GC content
IGV for sequence data visualiza;on
R + ggplot2 for plocng coverage distribu;on
Distribu;on of coverage over gene intervals
Faceted by prep method and GC content

Increasing concentra-ons of betaine improve coverage

GC-rich genes
High GC are badly Betaine rescues
intervals covered high-GC genes
(x>0.6) by new
protocol
without
betaine
Med. GC
Intervals
(0.4<x<0.6)

Wide distribu;on Narrow distribu;on


=> uneven coverage => even coverage
Low GC
Intervals
(x<0.4)

NormalizaGon:
X
Norm(X) =
Mean(x)
Visual inspec;on of select intervals shows dieren;al eects

Old protocol

New protocol (no betaine)

Coverage in GC-rich
regions increases with
betaine concentra;on New protocol +1M betaine

New protocol +2M betaine

Coverage is similar between GC content = 0.69


GC-rich and average regions
Example of depth of coverage analysis on WEx

Which exome technology provides the best coverage over the


intervals that interest us?
Tech 1
Tech 2 diers mainly by loca;on of baits rela;ve to exons

Data
Test sample NA12878, Illumina HiSeq WEx
List of exome target intervals of interest

Tools & methods


DiagnoseTargets -> Per-interval summary of usability metrics
IGV for sequence data visualiza;on
VCF output by DiagnoseTargets
Degrees of failure
Example of dpue to problema;c
roblema;c sequence
sequence context
context

Tech 1 coverage was


very low in this area
Old Tech 1

But deeper coverage in the rest inates the overall


Tech 1 coverage score for the exon, allowing it to pass lters

Tech 2 performs badly in the area where Tech 1 also fails; failure is more obvious!
Tech 2

exon

Tech 1 interval
Tech2 interval extends to the intron (250 bp upstream, also void of coverage)

Caveat: raw sequence data displayed here are by deni;on not normalized, so comparisons should be limited to rela;ve
amounts of coverage between areas per technology, rather than absolute amounts between technologies.
Loca;on of bait sets plays drama;c role in exome usability

Tech 1 provided
decent coverage
so sequence
context is not the
Tech 1 problem

Tech 2 produces
bad coverage in
Tech 2 Tech 2 produces abundant
coverage in the intron region area of interest

intron exon

Tech 1 interval

Tech 2 interval
Caveat: raw sequence data displayed here are by deni;on not normalized, so comparisons should be limited to rela;ve
amounts of coverage between areas per technology, rather than absolute amounts between technologies.
QC of workows and algorithms based on benchmarking

Choose a public, common sample


Human: NA12878 + parents
Sequence with mul;ple technologies (machines and
protocols)
- With and without PCR, dierent machines, dierent read-
lengths, etc.)
Compare to a knowledgebase (E.g. NIST GIAB)
Make sure that results are beper or at least no worse
(specic comparison metrics will be discussed towards the
end of the workshop)

SYSTEMATIC PIPELINE QC
Example: QC in the Broads produc;on pipeline

(1) Fidelity of barcode matching, cluster


QC
density, number of reads, bases, etc.

(2) Quality of alignment, library construc;on,


coverage, base quality, internal controls, SAM QC
format valida;on + Iden;ty through ngerprints

(3) Cumula;ve quality from (2) by sample,


cross-sample contamina;on + Iden;ty QC
ngerprints, read groups cross-check

(4) VCF format valida;on, genotype


concordance on control samples, QC QC
variant calling quality metrics

Data that fail any step of quality or idenGty vericaGon should get blacklisted
Controlling for contamina;on and sample swaps

Contamina;on (and barcode-swapping) can be checked


using VerifyBamID
Check for contamina;on in tumor samples using ContEst
Fingerprin;ng (currently private code)
Run samples on ngerprin;ng chip (~100 sites)
Gives Odds ra;o between a swap and a no swap situa;on
One could use GenotypeConcordance as a subs;tute for the
private code but it would only work well on the aggregated bam
Cross-check results from separate lanes once aggregated
per-sample (again using private code)
Tools for systema;c QC of sequence and mapping quality

Picard Metrics collec;on tools


See Collect*Metrics tools in
hpps://broadins;tute.github.io/picard/index.html
Metrics are dened in
hpps://broadins;tute.github.io/picard/picard-metric-deni;ons.html

User-friendly alterna;ves with GUI (we dont)
FastQC
Specializes in basic sequence quality assessment
QualiMap
Specializes in mapping quality assessment
Typical quality failures detected by Picard QC tools

Normal amounts of raw data (in Gb) but poor target coverage
High propor;on of chimerism
Strange insert size distribu;on (too big / too small)
Shearing-based oxida;on (poor OxoQ values)
Library size too small

Exomes Whole genomes


Severe unevenness in distribu;on High propor;on of unmapped reads
of coverage (Fold80 penalty values) High percentage of adapter/oligo
HS / reference bias based oxida;on
(poor cref-OxoQ values)
Enough data produced but low mapping rate

% PF reads % PF reads
(pass lters) aligned
Fewer than 80% reads produced (that pass
93.031 78.435 quality lters) were mapped; typical alignment
93.378 74.277 for human exome is > 98%
High percentage of duplicate reads

Mean % Excl % Excl % Excl


coverage dupe overlap Total
32 5.2 7.9 17.3
32 21.1 2.3 25.1
23 20.4 1.8 25.8

Appear to reach
coverage target but
values are inated
by duplica;on
Showing duplicate reads Hiding duplicate reads
Uneven coverage in a PCR-Free whole genome

Mean % bases Reaches overall coverage target but data is


coverage at 15X unevenly distributed: piles of reads in some
31.6 69 places alternate with uncovered regions

Unevenly covered
WGS sample

Evenly covered
WGS sample
Read Group 1
2
3 Uneven coverage between read groups

Not always a problem some;mes we add an extra run for a sample to top up coverage
(but in this case RG2 in par;cular looks problema;c)
High percentage of chimerism

% Chimeras % Selected % Target


bases bases 20X

30.272 65.438 78.478


13.405 70.036 84.811

Reaches coverage goals but data integrity may be an issue as number of chimeric reads is so high;
could confound detec;on of structural rearrangements and indels.
Strange insert size distribu;on

Abnormal spike

Bacterial contamina;on in cheek swab samples produces


deep piles of partly aligned reads
talks

Further reading
hpp://www.broadins;tute.org/gatk/guide/
hpps://broadins;tute.github.io/picard/index.html
hpps://broadins;tute.github.io/picard/picard-metric-deni;ons.html

You might also like