GATKwr8 S 3 Variant Calling With MuTect

talks
Calling variants with MuTect
SNPs today, Indels tomorrow

Soma=c Variant Discovery Workow
Indels coming
soon! (M2)
+ some post-processing
to rescue TiN variants
and eliminate ar<facts
Challenge : deal with dicult-to-predict allelic frac=on
33% N 67% T
T
N
T
(1) Purity = 67%

(2) Local copy number in tumor = 4
(3) Number of mutated copies per cancer cell = 1

-> Allelic frac<on = 2/10 = 0.2
Remember T/N pair comparisons
TUMOR
NORMAL
PART 1:
THE ORIGINAL MUTECT
MuTect: A brief history
MuTect was born as Tector back in 2009

Published in Nature Biotech, 2013

Core sta=s=cal model is a bayesian classier, for both
detec=on of a point variant in the tumor and classica=on
of the event as soma=c

Pragma=c, hard-lters used to control false posi=ve rate
MuTect method overview
Variant filters (site-based)

Tumor Normal
Panel of normal dbSNP
samples
Read filters
Candidate somatic mutations

Proximal gap Strand bias
HC + PON callset
STD callset
HC callset
L[Mfm]P(m,f) ?
N
log10 log10 T
L[M0](1P(m,f))
Variant detection statistic Poor mapping Triallelic site

Variant
T
classification
N
Clustered Observed
position in control
mutations using germ-line events, which differ from somatic muta-

ers tions in their nucleotide substitution frequencies and context. As
n developed, but there are recalibrated base qualities vary for the different
Cibulskis et al. bases
Nat. B(owing to(biases
iotechnol 2013)
Core variant detec=on algorithm
MuTect was based on the GATKs UniedGenotyper

(locus-based caller, but without the ploidy expecta=on)
For each posi=on in the input intervals:

For each possible alternate allele:
Es=mate the allele frac=on from the tumor pileup
Calculate hom-ref and het likelihoods
Report tumor LOD as log-odds ra=o of het to hom-ref
f is the allele frac=on es=mate
r is the reference allele
m is the variant (mutant) allele
ei is the error rate at base bi
ei/3 is the rate at which an error in the ref
matches the alt m

Internal lters
Filters internal to MuTect no VQSR yet!

e detection of a somatic
uTect. MuTect takes as Variant filters (site-based)
equencing data from Proximal gap Tumor Normal
Panel of normal
ples and, after removing samples
Strand bias Read filters
lementary Methods),
re is evidence for a variant Proximal gap Strand bias
ndom sequencing errors.
Poor mapping m
STD callset
are then passed through six
HC callset
L[Mf ]P(m,f) ?
log10 log10 T
ts (Table 1). Next, a panel
N) filter is used to Triallelic site
screen L[M0](1P(m,f))
Poor mapping Triallelic site
Variant detection statistic
itives caused by rare error
Clustered posi=on
n additional samples.
T
erm-line status of passing N
using the matched Observed in normal

normal Clustered Observed
HC, high confidence. position in control
mutations using germ-line events, which differ

sessing mutation callers tions in their nucleotide substitution frequenc
Resources
Panel of Normals (PON) = blacklist

dbSNP = blacklist (less trusted than PON)
COSMIC = whitelist
Applica=on of the Panel of Normals (PoN)
Would you consider this a good somatic variant candidate, based on the tumor and
matched normal pairing?
Tumor Matched Normal

(alt: 7, tot: 100) (tot: 81)
What if 968 samples in a panel of ~8500 normals exhibit similar low AF alternate alleles
and low to moderate levels of noise at and around the locus? Would you still consider
this a somatic variant, or simply an error prone site?
Tumor Matched Normal PoN member 1 PoN member 2

(alt: 7, tot: 100) (tot: 81) (alt: 6, tot: 128) (alt: 7, tot: 120)
PoN member 3 PoN member 4 PoN member 5 PoN member 6:

(alt: 3, tot: 61) (alt: 3, tot: 140) (alt: 8, tot: 234) (alt: 3, tot: 67)
What about this one? Does this look like a good event?

(alt: 4, tot: 49) (tot: 65)
11/~8500 PoN members display a similar low AF allele near what appears to be an
upstream germline variant.

(alt: 4, tot: 49) (tot: 65)
PoN member 1: PoN member 2:

(alt: 10, tot: 39) (alt: 8, tot: 36)
How to run MuTect
MuTect shares some code with the GATK, but is

currently built into its own jar
Version numbers may vary

java jar mutect-1.1.7.jar T MuTect \
R human.fasta \
I:normal normal.bam \
I:tumor tumor.bam \
--dbsnp dbsnp137.vcf \
--cosmic cosmic.vcf \
[ L exome_targets.intervals \ ]
o sample.call_stats.txt
--coverage_le sample.coverage.wig.txt

How to run MuTect without a matched normal
MuTect can be run without a matched normal, but

germline SNPs will be called
Panel of normals will help

R human.fasta \
I:tumor tumor.bam \
--normal_panel PoN.vcf \
o sample.call_stats.txt
--coverage_le sample.coverage.wig.txt

Making a PON
First we run MuTect on a set of normals to detect

common errors that appear as variants
Then we combine the normal callsets and retain
variants called in at least two samples
For each normal sample:
R human.fasta \
I:tumor normal.bam \
--ar<fact_detec<on_mode \
vcf normal1.call_stats.vcf
--coverage_le normal1.coverage.wig.txt

Making a PON
First we run MuTect on a set of normals to detect

common errors that appear as variants
Then we combine the normal callsets and retain
variants called in at least two samples
java jar GenomeAnalysisTK.jarT CombineVariant \
R human.fasta \
V normal1.call_stats.vcf \
V normal2.call_stats.vcf \
V ... \
minN 2 \
--lteredrecordsmergetype KEEP_IF_ANY_UNFILTERED \
--lteredAreUncalled \
o PoN.vcf
Output
MuTect output has numerous columns

Some of the most interes=ng are given below
...
Output
MuTect columns of interest:

Context: some cancer types have muta=on signatures
t_lod_fstar: nal adjusted tumor LOD score
tumor_f: allele frac=on of variant in tumor
strand_bias_counts: counts of fwd/rev +/- reads
observed_in_normals_count: evidence
failure_reasons: internal lters applied
judgement: nal lter status
A N A LY S I S
Performance of the original MuTect (M1)
0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60
Tumor sample sequencing depth Tumor sample sequencing depth Tumor sample sequencing depth
b ROC curve at 30x, allele frac<on=0.1 using real data
c1.0 50.00 d 1.0 f = 0.1
rate (Mb1) at reference sites
MuTect STD
0.8
0.8 MuTect HC
5.00
MuTect HC + PON
False positive
Sensitivity
0.6
Sensitivity
0.6 SomaticSniper STD
SomaticSniper HC
0.4 0.50
0.4 JointSNVMix STD
JointSNVMix HC
0.2
0.05 0.2 Strelka STD
Strelka HC
0
00.01 10 20 30 40 50 60 0
0Tumor10sample
20 sequencing
30 40 depth
50 60 0 5 10 15 20
1
Tumor
Calculation sample sequencing depth
(Q35) f = 0.4 False positive rate (Mb )
MuTect STD (virtual tumors) f = 0.2
hods. (a) Sensitivity as a function
MuTect HC (virtual tumors) of tumor
f = sample
0.1 sequencing depth and mutation allele fraction (f ) for the
gurations. (b) Somatic
MuTect miscall error rate for
HC (downsampling) true germ-line sites as a function of sequencing depth in the
f = 0.05
true reference sites
MuTect HC +as
PON a function of tumor sample sequencing depth. Dashed line, desired false positive rate.
(downsampling)
ations with an allele fraction of 0.1, tumor sample sequencing depth of 30 and normal sample sequencing
ations. Black dashed lines indicate change in sensitivity and specificity between STD and HC configurations
MuTect excels at accurately detec<ng low allele frac<on muta<ons, hence
ults of virtual-tumor approach from Supplementary Figure 3. Error bars, 95% confidence intervals (ac).
uniquely
classification suited
if the power tofor
make studying
a germ-lineimpure and heterogeneous tumors
an 95%. We also used public germ-line variation
ct HC detected more than melanoma cell line37 (Supplementary Cibulskis et aTable
l. Nat. B2). Although
iotechnol MuTect
(2013)
r probability of an event being germ-line.
Performance of the original MuTect (M1)
MuTect ranked highly in all 4 DREAM synthe=c challenges:
False Posi<ve Rate Balanced

simula<on Rank Sensi<vity Precision (muta<ons per Mb) Accuracy
100% purity 1st 0.967 0.984 0.021 0.975
80% purity 1st 0.961 0.992 0.010 0.977
100% purity
0.038
50%,33%,20% 2nd 0.918 0.981 0.949
CCF
80% purity
50%,35%, CCF 1st 0.741 0.983 0.051 0.862
Since MuTect is available for public download, several other

teams also submitted call sets using MuTect. In Challenge 4,
3 out of the top 4 teams used MuTect in their calling pipeline.
PART 2:
SNEAK PREVIEW OF MUTECT 2
MuTect 2 sneak preview: it calls indels!
Example: 22bp inser=on from DREAM challenge
BWA-MEM sol-clips the reads; this indel is denitely not called by Indelocator
MuTect 2 sneak preview: doing well in DREAM
Challenge 4: 80% Purity, 2 subclones (30%, 15% allele frac=on)

Method Specicity Sensi=vity Accuracy (F1)
SNPs
MuTect-RSp2 (Winner) 98.26% 74.13% 0.8620

M2 (Stock, MuTect PON) 97.49% 76.42% 0.8696
Method Specicity Sensi=vity Accuracy (F1)

INDELs
NovoBreak (Winner) 92.75% 78.80% 0.8578

Indelocator 54.94% 18.60% 0.3677
MuTect 2 design goals
Superior Unied SNP + Indel Calling for cancer analysis

As good as MuTect on SNPs + best indel caller
Elimina=on of Co-realignment step (expensive)
Sta=s=cal lters preferred over hard-lters
Support for new experimental data sets
Dieren=al Coverage of Tumor/Normal
Mul=-sample calling (Trios, Quads, CTCs)
Standardized VCF Output
Leverage Haplotype Caller technologies (assembler + PairHMM)
Local assembly is beper than pileup
Compare:
germline local assembly methods (HaplotypeCaller, Platypus)
to pileup-based methods (UniedGenotyper, SamTools)
Method WGS FDR WGS Sensi<vity WEx FDR WEx

Sensi<vity
SNPs
HC 0.12% 98.27% 0.16% 96.54%

UG 0.11% 98.48% 0.19% 96.23%
Method WGS FDR WGS Sensi<vity WEx FDR WEx

Sensi<vity
INDELs
HC 0.81% 93.39% 1.33% 77.08%

UG 5.68% 86.50% 0.01% 68.79%
Remember how HaplotypeCaller works?
hpps://www.broadins=tute.org/gatk/guide/ar=cle?id=4148
This is how MuTect 2 works
Ac<ve Regions are iden=ed

using original MuTect soma=c
sta=s=c, including indel events,
with low threshold (LOD 4.0,
similar to MuTect callstats
threshold)
Reads are dieren=ally ltered for tumor vs. normal

Tumor is strict: MAPQ Q20, discarding discrepant
overlapping fragments
Normal is permissive: MAPQ Q0, keep alternate
read from discrepant overlapping fragments
Assembly + PairHMM are

extremely similar to the
Haplotype Caller
Only high quality reads are

used in the assembly
Very minor technical changes

which impact soma=c calling
because our events are rare and
at low allele frac=on (lower
tolerance for losing reads)
Soma<c Genotyping Engine is

very similar to the MuTect
calcula=on, but rather than
using a likelihood based on base
quality scores, we use the
PairHMM Likelihoods
New sta=s=cs available versus MuTect

Now that were calling an en=re region at once, we can
see what you typically see in an IGV screenshot
New annota=ons
Event Count (ECNT): # of events in the haplotype

Min/Max Event Distance (MIN_ED/MAX_EX):
min/max distance between events
ECNT -> 6, MIN_ED -> 24bp , MAX_ED -> 131bp

Performance of M2 in DREAM 3
Challenge 3: 100% Pure, 3 subclones (50%, 33% 20% allele frac=on)
Method Specicity Sensi<vity Accuracy (F1)

WashU Viper (Winner) 98.995% 90.99% 0.94993
SNPs
MuTect - L630D8P2 98.093% 91.78% 0.94934


WashU Pindel (Winner) 97.63% 87.49% 0.92556
INDELs
Indelocator 54.51% 41.96% 0.48236

Challenge 4: 80% Purity, 2 subclones (30%, 15% allele frac=on)

MuTect-RSp2 (Winner) 98.26% 74.13% 0.8620
SNPs

NovoBreak (Winner) 92.75% 78.80% 0.8578
INDELs
Indelocator 54.94% 18.60% 0.3677

Challenge 5 (running): 80% Purity, Novalign, 25% allele frac=on(?)

Soma=c Variant Discovery Workow
Indels coming
soon! (M2)
+ some post-processing
to rescue TiN variants
and eliminate ar<facts
talks
Further reading
Documenta=on coming soon to the GATK website

In the mean=me, see
hpp://www.broadins=tute.org/cancer/cga/Home

GATKwr8 S 3 Variant Calling With MuTect

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

GATKwr8 S 3 Variant Calling With MuTect

Uploaded by

Copyright:

Available Formats

talks

Calling variants with MuTect

SNPs today, Indels tomorrow

(1) Purity = 67%

MuTect was born as Tector back in 2009

Variant filters (site-based)

Candidate somatic mutations

Variant detection statistic Poor mapping Triallelic site

mutations using germ-line events, which differ from somatic muta-

MuTect was based on the GATKs UniedGenotyper

For each posi=on in the input intervals:

Filters internal to MuTect no VQSR yet!

erm-line status of passing N

using the matched Observed in normal

mutations using germ-line events, which differ

Panel of Normals (PON) = blacklist

Tumor Matched Normal

Tumor Matched Normal PoN member 1 PoN member 2

PoN member 3 PoN member 4 PoN member 5 PoN member 6:

Tumor Matched Normal

Tumor Matched Normal

PoN member 1: PoN member 2:

MuTect shares some code with the GATK, but is

MuTect can be run without a matched normal, but

First we run MuTect on a set of normals to detect

First we run MuTect on a set of normals to detect

MuTect output has numerous columns

MuTect columns of interest:

MuTect ranked highly in all 4 DREAM synthe=c challenges:

False Posi<ve Rate Balanced

Since MuTect is available for public download, several other

Example: 22bp inser=on from DREAM challenge

Challenge 4: 80% Purity, 2 subclones (30%, 15% allele frac=on)

MuTect-RSp2 (Winner) 98.26% 74.13% 0.8620

Method Specicity Sensi=vity Accuracy (F1)

NovoBreak (Winner) 92.75% 78.80% 0.8578

Superior Unied SNP + Indel Calling for cancer analysis

Method WGS FDR WGS Sensi<vity WEx FDR WEx

HC 0.12% 98.27% 0.16% 96.54%

Method WGS FDR WGS Sensi<vity WEx FDR WEx

HC 0.81% 93.39% 1.33% 77.08%

Ac<ve Regions are iden=ed

Reads are dieren=ally ltered for tumor vs. normal

Assembly + PairHMM are

Only high quality reads are

Very minor technical changes

Soma<c Genotyping Engine is

New sta=s=cs available versus MuTect

Event Count (ECNT): # of events in the haplotype

ECNT -> 6, MIN_ED -> 24bp , MAX_ED -> 131bp

Challenge 3: 100% Pure, 3 subclones (50%, 33% 20% allele frac=on)

Method Specicity Sensi<vity Accuracy (F1)

MuTect - L630D8P2 98.093% 91.78% 0.94934

Method Specicity Sensi<vity Accuracy (F1)

Indelocator 54.51% 41.96% 0.48236

Challenge 4: 80% Purity, 2 subclones (30%, 15% allele frac=on)

Method Specicity Sensi<vity Accuracy (F1)

M2 (Stock, MuTect PON) 97.49% 76.42% 0.8696

Method Specicity Sensi<vity Accuracy (F1)

Indelocator 54.94% 18.60% 0.3677

Challenge 5 (running): 80% Purity, Novalign, 25% allele frac=on(?)

You might also like