GATKwr8 A 4 Sequence QC

talks
Quality Control of
High-Throughput Sequencing data
Methods for assessing quality

and detec;ng problems

TYPES OF QC
QC analysis takes many forms
Tech Dev: understanding limita;ons of technology

Library prepara;on protocols
Sequencing technologies
Data processing workows & algorithms
Systema;c pipeline QC
Quality -> exclude failed runs / poorly constructed libraries
Iden;ty -> detect contamina;on and swaps

LIMITATIONS OF TECHNOLOGY
Typical limita;ons of technology
Inability to capture data due to genome loca;on

and/or sequence composi;on (GC), reference ar;facts
Examples: High GC content, design of exome bait sets
-> Depth of coverage analysis
-> Library size esGmates

Sequencing chemistry/tech error modes
Example: High rates of FP indels in IonTorrent PGM
-> Indel Error Rate esGmaGon
Data processing algorithms

Example: Posi;on-based callers (UG, samtools)
do a poor job calling indels
-> Compare dierent tools/workows
Example of depth of coverage analysis on WGS
Which of several library construc;on protocols provides the

best coverage distribu;on for WGS?
Old protocol
New protocol + varying concentra;ons of betaine
Data
Test sample NA12878, Illumina HiSeq WGS
List of gene intervals of interest
Tools & methods

DepthOfCoverage -> Per-site values aggregated by gene interval
GCContentByInterval -> Per-gene GC content
IGV for sequence data visualiza;on
R + ggplot2 for plocng coverage distribu;on
Distribu;on of coverage over gene intervals
Faceted by prep method and GC content
Increasing concentra-ons of betaine improve coverage
GC-rich genes
High GC are badly Betaine rescues
intervals covered high-GC genes
(x>0.6) by new
protocol
without
betaine
Med. GC
Intervals
(0.4<x<0.6)
Wide distribu;on Narrow distribu;on

=> uneven coverage => even coverage
Low GC
Intervals
(x<0.4)
NormalizaGon:
X
Norm(X) =
Mean(x)
Visual inspec;on of select intervals shows dieren;al eects
Old protocol
New protocol (no betaine)
Coverage in GC-rich
regions increases with
betaine concentra;on New protocol +1M betaine
New protocol +2M betaine
Coverage is similar between GC content = 0.69

GC-rich and average regions
Example of depth of coverage analysis on WEx
Which exome technology provides the best coverage over the

intervals that interest us?
Tech 1
Tech 2 diers mainly by loca;on of baits rela;ve to exons
Data
Test sample NA12878, Illumina HiSeq WEx
List of exome target intervals of interest
Tools & methods

DiagnoseTargets -> Per-interval summary of usability metrics
IGV for sequence data visualiza;on
VCF output by DiagnoseTargets
Degrees of failure
Example of dpue to problema;c
roblema;c sequence
sequence context
context
Tech 1 coverage was

very low in this area
Old Tech 1
But deeper coverage in the rest inates the overall

Tech 1 coverage score for the exon, allowing it to pass lters
Tech 2 performs badly in the area where Tech 1 also fails; failure is more obvious!
Tech 2
exon
Tech 1 interval
Tech2 interval extends to the intron (250 bp upstream, also void of coverage)
Caveat: raw sequence data displayed here are by deni;on not normalized, so comparisons should be limited to rela;ve
amounts of coverage between areas per technology, rather than absolute amounts between technologies.
Loca;on of bait sets plays drama;c role in exome usability
Tech 1 provided
decent coverage
so sequence
context is not the
Tech 1 problem
Tech 2 produces
bad coverage in
Tech 2 Tech 2 produces abundant
coverage in the intron region area of interest
intron exon
Tech 1 interval
Tech 2 interval
Caveat: raw sequence data displayed here are by deni;on not normalized, so comparisons should be limited to rela;ve
amounts of coverage between areas per technology, rather than absolute amounts between technologies.
QC of workows and algorithms based on benchmarking
Choose a public, common sample

Human: NA12878 + parents
Sequence with mul;ple technologies (machines and
protocols)
- With and without PCR, dierent machines, dierent read-
lengths, etc.)
Compare to a knowledgebase (E.g. NIST GIAB)
Make sure that results are beper or at least no worse
(specic comparison metrics will be discussed towards the
end of the workshop)

SYSTEMATIC PIPELINE QC
Example: QC in the Broads produc;on pipeline
(1) Fidelity of barcode matching, cluster

QC
density, number of reads, bases, etc.
(2) Quality of alignment, library construc;on,

coverage, base quality, internal controls, SAM QC
format valida;on + Iden;ty through ngerprints
(3) Cumula;ve quality from (2) by sample,

cross-sample contamina;on + Iden;ty QC
ngerprints, read groups cross-check
(4) VCF format valida;on, genotype

concordance on control samples, QC QC
variant calling quality metrics
Data that fail any step of quality or idenGty vericaGon should get blacklisted
Controlling for contamina;on and sample swaps
Contamina;on (and barcode-swapping) can be checked

using VerifyBamID
Check for contamina;on in tumor samples using ContEst
Fingerprin;ng (currently private code)
Run samples on ngerprin;ng chip (~100 sites)
Gives Odds ra;o between a swap and a no swap situa;on
One could use GenotypeConcordance as a subs;tute for the
private code but it would only work well on the aggregated bam
Cross-check results from separate lanes once aggregated
per-sample (again using private code)
Tools for systema;c QC of sequence and mapping quality
Picard Metrics collec;on tools

See Collect*Metrics tools in
hpps://broadins;tute.github.io/picard/index.html
Metrics are dened in
hpps://broadins;tute.github.io/picard/picard-metric-deni;ons.html

User-friendly alterna;ves with GUI (we dont)
FastQC
Specializes in basic sequence quality assessment
QualiMap
Specializes in mapping quality assessment
Typical quality failures detected by Picard QC tools
Normal amounts of raw data (in Gb) but poor target coverage
High propor;on of chimerism
Strange insert size distribu;on (too big / too small)
Shearing-based oxida;on (poor OxoQ values)
Library size too small
Exomes Whole genomes

Severe unevenness in distribu;on High propor;on of unmapped reads
of coverage (Fold80 penalty values) High percentage of adapter/oligo
HS / reference bias based oxida;on
(poor cref-OxoQ values)
Enough data produced but low mapping rate
% PF reads % PF reads
(pass lters) aligned
Fewer than 80% reads produced (that pass
93.031 78.435 quality lters) were mapped; typical alignment
93.378 74.277 for human exome is > 98%
High percentage of duplicate reads
Mean % Excl % Excl % Excl

coverage dupe overlap Total
32 5.2 7.9 17.3
32 21.1 2.3 25.1
23 20.4 1.8 25.8
Appear to reach
coverage target but
values are inated
by duplica;on
Showing duplicate reads Hiding duplicate reads
Uneven coverage in a PCR-Free whole genome
Mean % bases Reaches overall coverage target but data is

coverage at 15X unevenly distributed: piles of reads in some
31.6 69 places alternate with uncovered regions
Unevenly covered
WGS sample
Evenly covered
WGS sample
Read Group 1
2
3 Uneven coverage between read groups
Not always a problem some;mes we add an extra run for a sample to top up coverage
(but in this case RG2 in par;cular looks problema;c)
High percentage of chimerism
% Chimeras % Selected % Target

bases bases 20X
30.272 65.438 78.478

13.405 70.036 84.811
Reaches coverage goals but data integrity may be an issue as number of chimeric reads is so high;
could confound detec;on of structural rearrangements and indels.
Strange insert size distribu;on
Abnormal spike
Bacterial contamina;on in cheek swab samples produces

deep piles of partly aligned reads
talks
Further reading
hpp://www.broadins;tute.org/gatk/guide/
hpps://broadins;tute.github.io/picard/index.html
hpps://broadins;tute.github.io/picard/picard-metric-deni;ons.html

GATKwr8 A 4 Sequence QC

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

GATKwr8 A 4 Sequence QC

Uploaded by

Copyright:

Available Formats

talks

Methods for assessing quality

Tech Dev: understanding limita;ons of technology

Inability to capture data due to genome loca;on

Data processing algorithms

Which of several library construc;on protocols provides the

Tools & methods

Increasing concentra-ons of betaine improve coverage

Wide distribu;on Narrow distribu;on

New protocol (no betaine)

New protocol +2M betaine

Coverage is similar between GC content = 0.69

Which exome technology provides the best coverage over the

Tools & methods

Tech 1 coverage was

But deeper coverage in the rest inates the overall

Choose a public, common sample

(1) Fidelity of barcode matching, cluster

(2) Quality of alignment, library construc;on,

(3) Cumula;ve quality from (2) by sample,

(4) VCF format valida;on, genotype

Contamina;on (and barcode-swapping) can be checked

Picard Metrics collec;on tools

Exomes Whole genomes

Mean % Excl % Excl % Excl

Mean % bases Reaches overall coverage target but data is

% Chimeras % Selected % Target

30.272 65.438 78.478

Bacterial contamina;on in cheek swab samples produces

You might also like