You are on page 1of 37

talks

Calling variants with MuTect

SNPs today, Indels tomorrow


Soma=c Variant Discovery Workow

Indels coming
soon! (M2)

+ some post-processing
to rescue TiN variants
and eliminate ar<facts
Challenge : deal with dicult-to-predict allelic frac=on

33% N 67% T

T
N
T

(1) Purity = 67%


(2) Local copy number in tumor = 4
(3) Number of mutated copies per cancer cell = 1

-> Allelic frac<on = 2/10 = 0.2
Remember T/N pair comparisons

TUMOR
NORMAL
PART 1:
THE ORIGINAL MUTECT
MuTect: A brief history

MuTect was born as Tector back in 2009



Published in Nature Biotech, 2013


Core sta=s=cal model is a bayesian classier, for both
detec=on of a point variant in the tumor and classica=on
of the event as soma=c


Pragma=c, hard-lters used to control false posi=ve rate
MuTect method overview

Variant filters (site-based)


Tumor Normal
Panel of normal dbSNP
samples
Read filters

Candidate somatic mutations


Proximal gap Strand bias

HC + PON callset
STD callset

HC callset
L[Mfm]P(m,f) ?
N
log10 log10 T
L[M0](1P(m,f))

Variant detection statistic Poor mapping Triallelic site


Variant
T
classification
N

Clustered Observed
position in control

mutations using germ-line events, which differ from somatic muta-


ers tions in their nucleotide substitution frequencies and context. As
n developed, but there are recalibrated base qualities vary for the different
Cibulskis et al. bases
Nat. B(owing to(biases
iotechnol 2013)
Core variant detec=on algorithm

MuTect was based on the GATKs UniedGenotyper


(locus-based caller, but without the ploidy expecta=on)

For each posi=on in the input intervals:


For each possible alternate allele:
Es=mate the allele frac=on from the tumor pileup
Calculate hom-ref and het likelihoods
Report tumor LOD as log-odds ra=o of het to hom-ref
f is the allele frac=on es=mate
r is the reference allele
m is the variant (mutant) allele
ei is the error rate at base bi
ei/3 is the rate at which an error in the ref
matches the alt m


Internal lters

Filters internal to MuTect no VQSR yet!


e detection of a somatic
uTect. MuTect takes as Variant filters (site-based)
equencing data from Proximal gap Tumor Normal
Panel of normal
ples and, after removing samples
Strand bias Read filters
lementary Methods),
re is evidence for a variant Proximal gap Strand bias
ndom sequencing errors.
Poor mapping m

STD callset
are then passed through six

HC callset
L[Mf ]P(m,f) ?
log10 log10 T
ts (Table 1). Next, a panel
N) filter is used to Triallelic site
screen L[M0](1P(m,f))
Poor mapping Triallelic site
Variant detection statistic
itives caused by rare error
Clustered posi=on
n additional samples.
T

erm-line status of passing N

using the matched Observed in normal


normal Clustered Observed
HC, high confidence. position in control

mutations using germ-line events, which differ


sessing mutation callers tions in their nucleotide substitution frequenc
Resources

Panel of Normals (PON) = blacklist


dbSNP = blacklist (less trusted than PON)
COSMIC = whitelist
Applica=on of the Panel of Normals (PoN)
Would you consider this a good somatic variant candidate, based on the tumor and
matched normal pairing?

Tumor Matched Normal


(alt: 7, tot: 100) (tot: 81)
Applica=on of the Panel of Normals (PoN)
What if 968 samples in a panel of ~8500 normals exhibit similar low AF alternate alleles
and low to moderate levels of noise at and around the locus? Would you still consider
this a somatic variant, or simply an error prone site?

Tumor Matched Normal PoN member 1 PoN member 2


(alt: 7, tot: 100) (tot: 81) (alt: 6, tot: 128) (alt: 7, tot: 120)

PoN member 3 PoN member 4 PoN member 5 PoN member 6:


(alt: 3, tot: 61) (alt: 3, tot: 140) (alt: 8, tot: 234) (alt: 3, tot: 67)
Applica=on of the Panel of Normals (PoN)
What about this one? Does this look like a good event?

Tumor Matched Normal


(alt: 4, tot: 49) (tot: 65)
Applica=on of the Panel of Normals (PoN)
11/~8500 PoN members display a similar low AF allele near what appears to be an
upstream germline variant.

Tumor Matched Normal


(alt: 4, tot: 49) (tot: 65)

PoN member 1: PoN member 2:


(alt: 10, tot: 39) (alt: 8, tot: 36)
How to run MuTect

MuTect shares some code with the GATK, but is


currently built into its own jar
Version numbers may vary

java jar mutect-1.1.7.jar T MuTect \
R human.fasta \
I:normal normal.bam \
I:tumor tumor.bam \
--dbsnp dbsnp137.vcf \
--cosmic cosmic.vcf \
[ L exome_targets.intervals \ ]
o sample.call_stats.txt
--coverage_le sample.coverage.wig.txt

How to run MuTect without a matched normal

MuTect can be run without a matched normal, but


germline SNPs will be called
Panel of normals will help

java jar mutect-1.1.7.jar T MuTect \
R human.fasta \
I:tumor tumor.bam \
--normal_panel PoN.vcf \
--dbsnp dbsnp137.vcf \
--cosmic cosmic.vcf \
[ L exome_targets.intervals \ ]
o sample.call_stats.txt
--coverage_le sample.coverage.wig.txt

Making a PON

First we run MuTect on a set of normals to detect


common errors that appear as variants
Then we combine the normal callsets and retain
variants called in at least two samples
For each normal sample:
java jar mutect-1.1.7.jar T MuTect \
R human.fasta \
I:tumor normal.bam \
--ar<fact_detec<on_mode \
--dbsnp dbsnp137.vcf \
--cosmic cosmic.vcf \
[ L exome_targets.intervals \ ]
vcf normal1.call_stats.vcf
--coverage_le normal1.coverage.wig.txt

Making a PON

First we run MuTect on a set of normals to detect


common errors that appear as variants
Then we combine the normal callsets and retain
variants called in at least two samples
java jar GenomeAnalysisTK.jarT CombineVariant \
R human.fasta \
V normal1.call_stats.vcf \
V normal2.call_stats.vcf \
V ... \
minN 2 \
--lteredrecordsmergetype KEEP_IF_ANY_UNFILTERED \
--lteredAreUncalled \
[ L exome_targets.intervals \ ]
o PoN.vcf
Output

MuTect output has numerous columns


Some of the most interes=ng are given below

...
Output

MuTect columns of interest:


Context: some cancer types have muta=on signatures
t_lod_fstar: nal adjusted tumor LOD score
tumor_f: allele frac=on of variant in tumor
strand_bias_counts: counts of fwd/rev +/- reads
observed_in_normals_count: evidence
failure_reasons: internal lters applied
judgement: nal lter status
A N A LY S I S
Performance of the original MuTect (M1)
0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60
Tumor sample sequencing depth Tumor sample sequencing depth Tumor sample sequencing depth
b ROC curve at 30x, allele frac<on=0.1 using real data
c1.0 50.00 d 1.0 f = 0.1
rate (Mb1) at reference sites

MuTect STD
0.8
0.8 MuTect HC
5.00
MuTect HC + PON
False positive
Sensitivity

0.6

Sensitivity
0.6 SomaticSniper STD
SomaticSniper HC
0.4 0.50
0.4 JointSNVMix STD
JointSNVMix HC
0.2
0.05 0.2 Strelka STD
Strelka HC
0
00.01 10 20 30 40 50 60 0
0Tumor10sample
20 sequencing
30 40 depth
50 60 0 5 10 15 20
1
Tumor
Calculation sample sequencing depth
(Q35) f = 0.4 False positive rate (Mb )
MuTect STD (virtual tumors) f = 0.2
hods. (a) Sensitivity as a function
MuTect HC (virtual tumors) of tumor
f = sample
0.1 sequencing depth and mutation allele fraction (f ) for the
gurations. (b) Somatic
MuTect miscall error rate for
HC (downsampling) true germ-line sites as a function of sequencing depth in the
f = 0.05
true reference sites
MuTect HC +as
PON a function of tumor sample sequencing depth. Dashed line, desired false positive rate.
(downsampling)

ations with an allele fraction of 0.1, tumor sample sequencing depth of 30 and normal sample sequencing
ations. Black dashed lines indicate change in sensitivity and specificity between STD and HC configurations
MuTect excels at accurately detec<ng low allele frac<on muta<ons, hence
ults of virtual-tumor approach from Supplementary Figure 3. Error bars, 95% confidence intervals (ac).
uniquely
classification suited
if the power tofor
make studying
a germ-lineimpure and heterogeneous tumors
an 95%. We also used public germ-line variation
ct HC detected more than melanoma cell line37 (Supplementary Cibulskis et aTable
l. Nat. B2). Although
iotechnol MuTect
(2013)
r probability of an event being germ-line.
Performance of the original MuTect (M1)

MuTect ranked highly in all 4 DREAM synthe=c challenges:

False Posi<ve Rate Balanced


simula<on Rank Sensi<vity Precision (muta<ons per Mb) Accuracy
100% purity 1st 0.967 0.984 0.021 0.975
80% purity 1st 0.961 0.992 0.010 0.977
100% purity
0.038
50%,33%,20% 2nd 0.918 0.981 0.949
CCF
80% purity
50%,35%, CCF 1st 0.741 0.983 0.051 0.862

Since MuTect is available for public download, several other


teams also submitted call sets using MuTect. In Challenge 4,
3 out of the top 4 teams used MuTect in their calling pipeline.
PART 2:
SNEAK PREVIEW OF MUTECT 2
MuTect 2 sneak preview: it calls indels!

Example: 22bp inser=on from DREAM challenge

BWA-MEM sol-clips the reads; this indel is denitely not called by Indelocator
MuTect 2 sneak preview: doing well in DREAM

Challenge 4: 80% Purity, 2 subclones (30%, 15% allele frac=on)



Method Specicity Sensi=vity Accuracy (F1)
SNPs

MuTect-RSp2 (Winner) 98.26% 74.13% 0.8620


M2 (Stock, MuTect PON) 97.49% 76.42% 0.8696

Method Specicity Sensi=vity Accuracy (F1)


INDELs

NovoBreak (Winner) 92.75% 78.80% 0.8578


Indelocator 54.94% 18.60% 0.3677
M2 (Stock, MuTect PON) 97.06% 77.52% 0.8729
MuTect 2 design goals

Superior Unied SNP + Indel Calling for cancer analysis


As good as MuTect on SNPs + best indel caller
Elimina=on of Co-realignment step (expensive)
Sta=s=cal lters preferred over hard-lters
Support for new experimental data sets
Dieren=al Coverage of Tumor/Normal
Mul=-sample calling (Trios, Quads, CTCs)
Standardized VCF Output
Leverage Haplotype Caller technologies (assembler + PairHMM)
Local assembly is beper than pileup

Compare:
germline local assembly methods (HaplotypeCaller, Platypus)
to pileup-based methods (UniedGenotyper, SamTools)

Method WGS FDR WGS Sensi<vity WEx FDR WEx


Sensi<vity
SNPs

HC 0.12% 98.27% 0.16% 96.54%


UG 0.11% 98.48% 0.19% 96.23%

Method WGS FDR WGS Sensi<vity WEx FDR WEx


Sensi<vity
INDELs

HC 0.81% 93.39% 1.33% 77.08%


UG 5.68% 86.50% 0.01% 68.79%
Remember how HaplotypeCaller works?

hpps://www.broadins=tute.org/gatk/guide/ar=cle?id=4148
This is how MuTect 2 works

Ac<ve Regions are iden=ed


using original MuTect soma=c
sta=s=c, including indel events,
with low threshold (LOD 4.0,
similar to MuTect callstats
threshold)

Reads are dieren=ally ltered for tumor vs. normal


Tumor is strict: MAPQ Q20, discarding discrepant
overlapping fragments
Normal is permissive: MAPQ Q0, keep alternate
read from discrepant overlapping fragments
This is how MuTect 2 works

Assembly + PairHMM are


extremely similar to the
Haplotype Caller

Only high quality reads are


used in the assembly

Very minor technical changes


which impact soma=c calling
because our events are rare and
at low allele frac=on (lower
tolerance for losing reads)
This is how MuTect 2 works

Soma<c Genotyping Engine is


very similar to the MuTect
calcula=on, but rather than
using a likelihood based on base
quality scores, we use the
PairHMM Likelihoods

New sta=s=cs available versus MuTect


Now that were calling an en=re region at once, we can
see what you typically see in an IGV screenshot
New annota=ons

Event Count (ECNT): # of events in the haplotype


Min/Max Event Distance (MIN_ED/MAX_EX):
min/max distance between events

ECNT -> 6, MIN_ED -> 24bp , MAX_ED -> 131bp


Performance of M2 in DREAM 3

Challenge 3: 100% Pure, 3 subclones (50%, 33% 20% allele frac=on)

Method Specicity Sensi<vity Accuracy (F1)


WashU Viper (Winner) 98.995% 90.99% 0.94993
SNPs

MuTect - L630D8P2 98.093% 91.78% 0.94934


M2 (Stock, MuTect PON) 95.200% 93.06% 0.94125

Method Specicity Sensi<vity Accuracy (F1)


WashU Pindel (Winner) 97.63% 87.49% 0.92556
INDELs

Indelocator 54.51% 41.96% 0.48236


M2 (Stock, MuTect PON) 91.20% 90.99% 0.91093
Performance of M2 in DREAM 4

Challenge 4: 80% Purity, 2 subclones (30%, 15% allele frac=on)

Method Specicity Sensi<vity Accuracy (F1)


MuTect-RSp2 (Winner) 98.26% 74.13% 0.8620
SNPs

M2 (Stock, MuTect PON) 97.49% 76.42% 0.8696

Method Specicity Sensi<vity Accuracy (F1)


NovoBreak (Winner) 92.75% 78.80% 0.8578
INDELs

Indelocator 54.94% 18.60% 0.3677


M2 (Stock, MuTect PON) 97.06% 77.52% 0.8729
Performance of M2 in DREAM 5

Challenge 5 (running): 80% Purity, Novalign, 25% allele frac=on(?)


Soma=c Variant Discovery Workow

Indels coming
soon! (M2)

+ some post-processing
to rescue TiN variants
and eliminate ar<facts
talks

Further reading
Documenta=on coming soon to the GATK website

In the mean=me, see
hpp://www.broadins=tute.org/cancer/cga/Home

You might also like