You are on page 1of 15

talks

Es#ma#ng cross-sample contamina#on


with ContEst
Soma#c Variant Discovery Workow

Indels coming
soon! (M2)

+ some post-processing
to rescue TiN variants
and eliminate ar<facts
Would you trust a variant call made at this site?
Disambigua#ng types of contamina#on

Cross-sample (dierent people)


Tumor <-> normal (dierent #ssue)
Tumor subclones (dierent cell lines)
Bacterial cells (esp. in saliva, cheek swabs)

Tumor cells

Normal cells

Normal Tumor Other contamina#ng cells


ContEst: (cross-sample) Contamina#on Estima#on

Here contamina#on = cells


from other samples

Method described in
Cibulskis et al., 2011
bioinforma#cs.oxfordjournals.org/
content/27/18/2601

ContEst is not intended to determine stromal contamina<on


(the number of normal cells in your tumor sequence)

Stromal contamina#on is es#mated in post-processing using a


tool called ABSOLUTE by Carter et al.,
www.nature.com/nbt/journal/v30/n5/abs/nbt.2203.html
ContEst method in a nutshell

Evaluate genotypes of your sample at a set of sites that


are expected to be homozygous-variant

Contamina#ng
popula#on of
samples

Your Sample
Wait, how do I know which sites are hom-var?

Genotyping array Array-free


by on-the-y genotyping
Contamina#on es#ma#on with a genotyping array

Select sites that


are HOM-VAR in
the array data

Any REF at those


sites = probably
contamina#on
Contamina#on es#ma#on with on-the-y genotyping

Iden<fy HOM-VAR sites by genotyping the matched normal


(preferred) or tumor (if unmatched)

Call HOM-VAR any site


with > 80% bases
showing ALT with at
least 50X coverage

Any REF at those sites =


probably contamina#on
Popula#on allele frequency magers

Popula#on allele frequency is important too:


the mismatching reads only reect part of the true total contamina#on

Contamina#ng
popula#on of
samples

Your Sample
The underlying algorithm

c: contamination
f: minor allele frequency
e: sequencing error rate

1-c c
Bayesian approach to
calculate the posterior
probability of the
f 1-f
contamina#on level and
determine the maximum a
posteriori probability (MAP)
1-e e 1-e e e 1-e
es#mate of the
MINOR MAJOR MINOR MAJOR MINOR MAJOR contamina#on level
P(MINOR | genotype) = (1-c)(1-e) + cf(1-e) + c(1-f)(e)
P(MAJOR | genotype) = (1-c)(e) + cf(e) + c(1-f)(1-e)
How to run it


java jar ContEst.jar \
-T Contamina<on \
-R reference.fasta \
-I sample.bam \
-B:pop,vcf popula<on_stra<ed_af_hapmap.vcf \
-B:genotypes,vcf normal_sample.vcf \
-BTI genotypes \
-o contamina<on_results.txt

Contamina#on es#ma#on for the sample overall


(used by MuTect in next step)
Contamina#on for each lane in the sample
(by read group can blacklist RGs) add -llc
XXXXXXX
LANE
to your commandline
How to interpret the contamina#on values

0-2% - Fine, everything is good!

2-5% - Slightly contaminated, might be worth looking


into if your sample produces weird downstream results

>50% unusable contamina#on,


as you approach 100%
Between and 5 and 15%, heavily contamina#on theres a chance
contaminated but salvageable, its a sample swap
watch these samples, and expect
much manual review

Between 15 and 50%, heavy contaminated, most likely worth


removing samples and follow up with project management
Soma#c Variant Discovery Workow

Indels coming
soon! (M2)

+ some post-processing
to rescue TiN variants
and eliminate ar<facts
talks

Further reading
Documenta#on coming soon to the GATK website

In the mean#me, see
hgp://www.broadins#tute.org/cancer/cga/Home

You might also like