You are on page 1of 8

Cancer Genetics 206 (2014) 441e448

REVIEW ARTICLE
Understanding the limitations of next generation
sequencing informatics, an approach to clinical
pipeline validation using artificial data sets
Robert Daber*, Shrey Sukhadia, Jennifer J.D. Morrissette
Center for Personalized Diagnostics, University of Pennsylvania School of Medicine, Philadelphia, PA

The advantages of massively parallel sequencing are quickly being realized through the adoption
of comprehensive genomic panels across the spectrum of genetic testing. Despite such wide-
spread utilization of next generation sequencing (NGS), a major bottleneck in the implementation
and capitalization of this technology remains in the data processing steps, or bioinformatics. Here
we describe our approach to defining the limitations of each step in the data processing pipeline
by utilizing artificial amplicon data sets to simulate a wide spectrum of genomic alterations.
Through this process, we identified limitations of insertion, deletion (indel), and single nucleotide
variant (SNV) detection using standard approaches and described novel strategies to improve
overall somatic mutation detection. Using these artificial data sets, we were able to demonstrate
that NGS assays can have robust mutation detection if the data can be processed in a way that
does not lead to large genomic alterations landing in the unmapped data (i.e., trash). By using
these pipeline modifications and a new variant caller, AbsoluteVar, we have been able to validate
SNV mutation detection to 100% sensitivity and specificity with an allele frequency as low 4% and
detection of indels as large as 90 bp. Clinical validation of NGS relies on the ability for mutation
detection across a wide array of genetic anomalies, and the utility of artificial data sets demon-
strates a mechanism to intelligently test a vast array of mutation types.
Keywords Next generation sequencing, bioinformatics, validation, sensitivity, artificial data set
ª 2014 Elsevier Inc. All rights reserved.

Rapid advancements in next generation sequencing (NGS) overlapping genes, followed by confirmation of the additional
have opened the door for unprecedented diagnostic capa- variants identified in the NGS data with an alternative platform
bilities (1e4). While massively parallel sequencing makes it (1,7). This is a common approach, due to clinical expertise in
feasible to rapidly sequence large genomic regions, the these technologies and the availability of specimens with
overall utility is limited by the least dependable or reproduc- characterized mutations. This approach has limited utility in
ible steps in the assay (5,6). Although perceived as a single oncology specimens where tumor heterogeneity may result in
test, an NGS assay is composed of three distinct, yet inter- allelic burdens below 15e20%. In addition, methods such as
dependent modules consisting of library preparation, Sanger sequencing confirmation, often considered the “gold
sequencing, and bioinformatic analysis (Figure 1A). The standard,” are inadequate due to innate sensitivity limitations
methods used for each of these modules have intrinsic lim- (8). The aforementioned validation strategies also do not
itations, which critically impact the overall sensitivity and address false-negative detection in all of the genomic posi-
specificity of a given test. tions that are included in the panel, yet not fully interrogated
Most NGS validation strategies involve comparison of by an alternative methodology. Specifically, if a mutation,
mutations previously characterized by standard molecular such as a large insertion, was not previously characterized by
techniques to determine variant detection concordance in an alternative method, failure of the NGS assay to detect this
mutation would be unknown, since only NGS-identified vari-
Received October 1, 2013; received in revised form November ants in novel regions are further confirmed.
19, 2013; accepted November 20, 2013. One solution is a validation strategy that utilizes an alter-
* Corresponding author. native NGS methodology with overlapping coverage to
E-mail address: Robert.Daber@uphs.upenn.edu confirm false-positive and -negative results across all genes

2210-7762/$ - see front matter ª 2014 Elsevier Inc. All rights reserved.
http://dx.doi.org/10.1016/j.cancergen.2013.11.005
442 R. Daber et al.

within the panel and to confirm variants seen in this lower that they needed to impact a range of different loci. For reads that
allelic burden range. This approach is useful to address false received artificial insertions, each read was subsequently trim-
variant calls that may result from either the library preparation med back to 150 bp after inserting the artificial sequence. To
or sequencing methodology; however, most NGS assays simulate variant allele frequencies ranging from w1e50%,
share similar informatics pipelines. This creates a situation different variant files were created by counting the total number of
where variant-positive and -negative sites may be concordant matching target reads and then randomly replacing the wild type
between two different NGS approaches, yet may collectively sequence with the artificial sequence in the specified percent of
represent false calls due to limitations in the informatics total reads. The method utilized replacement of sequence based
alone. A classic example of this limitation is in the detection of upon a 100% match strategy; therefore, since random errors
intermediate-sized insertions or deletions (indels) (9,10). This exist in the data, we also had to count the number of times a
limitation represents a significant deficiency in the validation sequence was replaced to get a final absolute artificial allele
of NGS, since overall assay limits of detection (LOD) based frequency. In all situations, the actual artificial variant allele fre-
upon indel size cannot be adequately addressed. The quency was lower than the targeted percentage.
severity of this limitation is also amplified in many clinical labs After each artificial set of fastq files were generated, data
where the expertise of both the informatics groups and clinical was processed through either our standard analysis pipeline or
groups are mutually exclusive, often resulting in frustrating through a modified version of the pipeline. This pipeline utilized
and unsuccessful communication surrounding this bottleneck. the onboard analysis workflows on the MiSeq (Illumina), as
To bridge the gap between the expertise of these groups well as different iterations of a custom analysis workflow. In
and to demonstrate the impact each processing task has on addition to in house scripts, the following software programs
the raw data, we broke down each step in the informatics were used: BTrimV64 (11) to trim amplicon primer sequences;
pipeline to validate the informatics approach of our NGS Novoalign V2.07.18 and V3.00.05 for read mapping (http://
assay by defining the relationship between each processing novocraft.com); SamTools V0.1.18 to sort, merge, and index
step and the sensitivity and specificity of variant detection. bam files (12,13) (http://samtools.sourceforge.net/); Picard
We first created artificial data sets that modeled a range of V1.46 to fix read groups (http://picard.sourceforge.net/); Bed-
mutation types. These included single nucleotide variants Tools V2.12.0 to remove off-target reads (14) (https://code.
(SNVs), insertions, and deletions over a wide range of google.com/p/bedtools/); VCFTools V0.1.9 to sort vcf files
genomic loci and at defined allelic frequencies ranging from (15) (http://vcftools.sourceforge.net/); SnpEff V3.0 (16) (http://
w1e50%, to determine the detection limitations for the bio- snpeff.sourceforge.net/) and Annovar V25th May 2012 for
informatic software solutions utilized. Since every read variant annotation (17) (http://www.openbioinformatics.org/
generated from the sequencer has a unique read name, we annovar/); and GATK V1.6-13 for depth of coverage, read
were able to monitor the fate of every read during pipeline quality trimming, and variant calling (18,19) (http://www.
processing to determine at which steps artificial variant reads broadinstitute.org/gatk/). Quality filtering of bam files before
were either improperly mapped, filtered out for quality, or variant calling was performed to clip reads to an average base
simply not identified during the variant calling process. This quality score of 22 and to remove reads with mapping qualities
approach allowed us to investigate the advantages and lim- below 40 and alignment scores below 95. AbsoluteVar, a new
itations of various bioinformatic tools, different algorithms, variant caller, utilizes a base-counting strategy to determine
algorithm input parameters, and overall algorithm configu- the number of base counts for each position in the target re-
rations. Using this knowledge, we were then able to construct gion, and generates a variant call file for each position where a
novel algorithms and pipeline configurations to maximize non-reference base is detected at a depth and/or allele fre-
overall variant detection performance and to define actual quency specified by the user.
sensitivities and specificities of our analysis approach.
Results
Materials and methods
The initial analytical pipeline was modeled after those
commonly used in the NGS community. This algorithm be-
Data from one de-identified data set was used to create
gins with de-multiplexing of individual sample reads, followed
artificial fastq files used for validation. These data were
by mapping of reads to the reference genome, filtering of off-
generated from a custom amplicon panel containing 382
target reads, data quality filtering, base pair depth of
amplicons spread over 33 genes. Target enrichment was
coverage calculations across every base in the target re-
performed using a customized panel of amplicons (Truseq
gions, and, finally, variant calling and annotation (Figure 1B).
Custom Amplicon; Ilumina, San Diego, CA). As per the
Artificial fastq files were generated, which contained a range
manufacturer’s protocol, a library was created and subse-
of mutation types across the amplicon reads corresponding
quently sequenced on the Ilumina MiSeq with 150 bp paired
to our panel (Table 1, see Materials and methods), and
end reads to an average depth of coverage of 2,700X. After
processed through various iterations of the pipeline. Larger
de-multiplexing the data for each sample, the fastq files were
deletions were not tested since large deletions would influ-
processed through our laboratory’s informatics pipeline to
ence amplicon-based capture due to the primer locations.
identify variants present in the normal data (Figure 1B).
Artificial variant files were generated by searching the
fastq files for specific reference sequences (Supplemental Limitations of read alignment/mapping
Tables 1e3) and replacing the sequence text string with a
sequence containing one of the artificial variants. The locations Our validation began by exploring the impact of various data
and mutations were selected at random, with the requirement alignment/mapping approaches and input parameters to
NGS validation and artificial data sets 443

Figure 1 (A) Modularity of next generation sequencing assays: library preparation, sequencing, and bioinformatics. (B) Data pro-
cessing pipeline highlighting the various steps through which data must proceed to generate a final report. Increased sensitivity for
mutation detection was achieved through development of two new algorithms, AbsoluteVar and GarbagePicker, and by calling large
indels on unfiltered data.

understand the interplay between successful alignment and workflow), reads with a homologous sequence on chromo-
the sensitivity and specificity of mutation detection. We some 2 were properly placed onto chromosome 9. Corrected
compared two strategies commonly employed in NGS soft- mapping of the off-target capture reads resulted in removal of
ware solutions: mapping to the entire genome or mapping to a false-positive variant reads on chromosome 9. False-positive
restricted genome representing the amplicon target regions variants, which were produced by mapping restricted to the
and potential off-target homologous sites. Mapping reads to amplicon panel, were identified in every sample and, therefore,
the amplicon target regions has been utilized by the onboard could be filtered as artifacts; however, mismapping resulted in
bioinformatics processing in the Illumina MiSeq as part of the inaccurate read depths at these loci, which diluted the allele
amplicon workflow to reduce data processing times. This frequency of true variants. With variant calling dependent upon
approach is limited by the need to know all potential off-target allele frequency, this dilution could result in false-negative calls
capture regions a priori. Comparison of these mapping stra- for real mutations within the targeted amplicon.
tegies identified several false-positive and -negative variant Using the whole genome for read mapping, all artificial reads
calls resulting from the restricted alignment search space. containing SNV insertions up to 34 bp and deletions up to 30 bp
When allowing read data to map to the entire genome with were successfully mapped to the correct genomic locations
both Novoalign (V2.07.18) or Illumina (MiSeq Resequencing using Novoalign (V2.07.18) as the aligner. Since the mapping
methodology aligns a single read at a time, there was no impact
Table 1 Artificial variants introduced into the raw data from the artificial allele frequency on mapping. For those read
Number of Allele frequency
pairs containing indels larger than w30 bp, two trends were
Artificial mutation loci/types Range of range for each
identified in the data based upon where the artificial indel
resided. We considered these either “orphaned” or “lost”
type tested sizes (bp) change (%)
alignment reads. The “orphaned” pattern occurred when an
SNV 25 1 1.6e50.1 artificial indel was found predominantly in one read for a given
Insertion 49 2e90 0.3e69.0 read pair and the indel-containing mate would fail to map, while
Deletion 40 1e37 0.5e44.0 its “normal” or mostly normal mate would map correctly. This
444 R. Daber et al.

Figure 2 Raw data visualized in the IGV, which highlights data patterns for a single amplicon. In normal samples sequenced by
paired-end reads, the distribution at the top of the window demonstrates longer bars with the overlapping segments between the
forward and reverse reads showing a higher depth of coverage. During data processing, reads containing an insertion (purple bar) in
only one read end up being filtered out, resulting in discordant read depths between the read pairs. (Color versions of these illustrations
are available on the journal’s website at www.cancergeneticsjournal.org.)

would occur when an insertion was present near the ends of an every amplicon read based upon searching for its primer
amplicon and affected only one read of the read pair. The “lost” sequences in the post-alignment data. With this tool, we
pattern occurred when the artificial indel affected both reads, were able to then identify amplicons where a given fraction of
leading to a general lack of mapping for both pairs. Since the the sequencing reads were flagged as unmapped and
data utilized in this approach was generated from amplicons manually investigate the sequence in the unmapped reads to
with the same starts and stops to every read, indels affecting identify the mutation. After a baseline was established for the
only one mate, and therefore demonstrating discordant read number of reads that did not map for all amplicons in the
counts over the amplicon, could be visually detected by panel, thresholds were established to flag only amplicons
reviewing the mapped data in the Integrative Genomics Viewer that showed an atypical mapping pattern (data not shown).
(IGV; http://www.broadinstitute.org/igv/) (Figure 2); however, With this approach, manual review of misbehaving amplicons
indels affecting both reads resulted in a normal data pattern detected the artificial mutations in every artificial data set.
(due to the lost, unmappable reads) and, therefore, could not be After our initial pipeline validation, Novoalign was updated
visually detected. Since these reads would never map, this (V3.0) to include a wider dynamic range for mapping quali-
demonstrated a hard limit of indel variant detection related to ties. Using the same artificial variant files, the updated
the first step in the pipeline. version of Novoalign was able to successfully map all artifi-
Review of the run data showed that each of the un- cial reads with an insertion size up to 90 bp. Deletions <20
mapped reads were present in the alignment file, but con- bp were always mapped properly, while deletions >20 bp
tained a flag value (http://samtools.sourceforge.net/SAMv1. were often mapped as if the remaining bases after the
pdf), which described that it was unmapped (http://picard. deletion were inserted. This marked a huge improvement in
sourceforge.net/explain-flags.html). To help identify ampli- the ability to detect large indels.
cons with a large fraction of unmapped reads, we developed
a tool called GarbagePicker. Since every amplicon contains Quality filtering
reads with the same start and stop primer sequence, we
were able to capture flag-value statistics (i.e., both mates Despite the ability to map indels as large as 90 bp, the data
mapped, one mate unmapped, both mates unmapped) for quality processing steps implemented after mapping resulted
NGS validation and artificial data sets 445

Figure 3 Mutation detection sensitivity for various-sized insertions comparing insertion detection on bam files before (circles) and
after (squares) quality filtering: (A) <6 bp. (B) 7e16 bp. (C) 17e26 bp. (D) 21e90 bp. Insertions <6 bp are detected equally with the
pre- and post-quality filtered data, whereas insertions >16 bp are never detected with post-quality filtering.

in the inability to consistently detect indels >6 bp without manual were properly identified by GATK as a deletion. The larger
data inspection or the use of GarbagePicker (Figure 3). Since deletions were often mis-identified as insertions, with the
GATK requires an indel fraction and minimum indel count undeleted portion of the read being “inserted” at the starting
parameter, all artificial indels 6 bp were detected to the point for the deletion. In some cases, mutant-containing
specified allele frequency of 5%, with no false-positives detec- reads were mapped as both an insertion and a deletion,
ted. By tracing individual artificial variant reads using their which split the total mutation allele frequency over two
unique read names (identifiers), we found that they were different variants. As a result, overall sensitivity was
removed during the quality filter step used to clean up data that decreased in these cases, where a variant was only identified
has either low mapping quality or low alignment scores if either the deletion or insertion were present at 4% allele
(Figure 2). By altering the stringency on quality control filtering, frequency (Supplemental Figure 1).
we were able to obtain higher sensitivity in detecting larger
indels; however, this decreased stringency had a negative
impact on false-positive SNV detection (data not shown). Single nucleotide variants
As an alternative approach to sacrificing SNV specificity
for indel detection sensitivity, we altered our processing To investigate the interplay of the data processing steps
pipeline to call large (>6 bp) insertions on the pre-filtered with respect to SNV detection, 25 variants were introduced
data set and call small insertions and SNVs after quality into the data sets at allelic ratios ranging from 3e50%. Using
filtering. By incorporating this change (Figure 1B) into our the standard informatics processing pipeline with GATK as the
processing pipeline, we were able to successfully identify variant caller, we were unable to identify a single variant below
insertions up to 90 bp and deletions up to 37 bp (the limit set 6% allele frequency (Figure 4A). Even at allele frequencies
by the mapping step) with no impact on specificity in the data higher than this 6%, 100% sensitivity could not be achieved,
analyzed (Figure 3). For lower limits of detection, 100% despite manual confirmation of the data in the bam file used
sensitivity was obtained by going as low as a 7% allelic for variant calling. Using IGV to review the mapped data, we
burden for pure (i.e., not complex) insertions and deletions, were able to manually adjust the allele frequency threshold for
respectively. Complex rearrangements involving both the consensus variant visualization, which also confirmed the
deletion of the wild type sequence and insertion of a foreign presence of each of these artificial mutations. False-positive
sequence at the same location were not tested. Despite the calls were also made where the allele frequency was 100%,
detection of deletions >20 bp, only those deletions <20 bp but the total read depth was only a few reads.
446 R. Daber et al.

Using GATK and AbsoluteVar, we also investigated the


ability to detect a rare event where a normally polymorphic
site in the genome contained a germline homozygous variant
in addition to a somatic variant at a low allelic burden. This
created an artificial situation where the data set contained
two nucleotide changes at one location compared to the
reference genome. For example, in one test case, the artifi-
cial data presented with a germline homozygous state of CC
compared to the reference genome, GG, in addition to a
somatic variant, T. Comparing GATK to AbsoluteVar, we
were unable to detect the T variant call below the 7% allele
frequency with GATK, but were able to identify the variant at
all allelic burdens >4% using AbsoluteVar (Figure 4B).

Artifacts due to variant position in amplicon reads

The read data used for validation was generated using our
amplicon panels. As a result of using this technology, nearly
every read for a given loci contained identical start and stop
positions. Informatic variant callers identified the genomic
coordinate and variant allele present, but did not specify
where in an amplicon the variant was located.
During validation, we identified an indel at a position fall-
ing at the end of one amplicon and the middle of another. As
a result of the position, this change was identified as both an
SNV and an insertion (Figure 5). In the first data set, the
analytical pipeline identified a missense change in TET2
resulting in a T>C. In this data set, the variant was present at
the last base in all of the sequencing reads for the amplicon.
In the overlapping amplicon, the data demonstrated a 2-bp
Figure 4 SNV detection comparing AbsoluteVar (circles) to insertion of CC at the same location, which resulted in a
GATK (triangles) at sites with (A) only two alleles present and (B) frameshift mutation. Both data sets were concordant for a C
at sites with two non-reference alleles present. Using Absolu- at the first variant call site; however, the absent sampling of
teVar 100% sensitivity is achieved at allele frequencies >4%, the adjacent nucleotide with the first amplicon resulted in a
compared to <100% sensitivity when using GATK. false variant call.

Discussion

A major deficit in GATK was the inability to set input pa- Next generation sequencing methods have the ability to
rameters to control the minimum depth of coverage or an detect a wide array of mutation types, which is mostly limited
allelic fraction for variant calling. To address this issue, we by the ability to properly process the raw data generated by
developed a novel algorithm, AbsoluteVar (see Materials and the sequencer. Through the creation of artificial variant files
methods), which provided the ability to define our thresholds. with known mutation positions and allelic frequencies, we
Utilizing this tool in place of GATK, we analyzed the artificial were able to directly follow next generation sequencing reads
SNV bam files with variant detection cutoff parameters as they progressed through the bioinformatic analysis pipe-
specified at 4% allele frequency for any base and a minimum line. This revealed the limitations of our initial pipeline and
depth of 100 reads, counting only bases with a quality score prompted changes in how large and small indels are identi-
>21. With this variant caller, we achieved 100% sensitivity fied, and illustrated the need for the development of addi-
and specificity down to a 5% allele burden (Figure 4A). As we tional tools to achieve higher mutation detection sensitivity.
approached the specified 4% cutoff, we saw a small Initial limitations of indel detection arose from the inability
decrease in sensitivity that resulted from the somewhat to map amplicon reads that contained these mutations. Large
variable number of reads mapping from repeat analysis to insertion and deletion containing reads were unable to be
analysis. Since the mapping step is based upon random mapped with early versions of alignment software, while
processes of matching seed sequences to the reference recent updates have overcome those issues by extending the
genome, we saw small variations in the number of reads range of mapping quality values assigned to each mapped
mapping in each repeat analysis that resulted in artificial read. The initial limit of w35 bp was dramatically increased to
variant allelic burdens, which ranged from 3.95% to slightly w90 bp with this update. This critical step presents the
>4.0%, and were therefore not called with the 4% hard community with a “clinical” informatics dilemma, since the first
cutoff. Pushing the variant calling threshold below 4% began step in processing the data takes raw sequences and maps
resulting in false-positive variant calls, and an average error them to a specific location in the genome; however, the
rate of 1% for SNVs was calculated across the panel. premise is to find the mismatch in the patient’s genome that
NGS validation and artificial data sets 447

Figure 5 Raw data visualized in IGV for a variant captured by two different amplicons covering that locus. In the top section, the single
nucleotide change (T>C) is shown at the last base of the amplicon. A longer amplicon in the lower panel confirms the presence of a C at
that genomic position; however, it demonstrates that the mutation was not a single nucleotide event, but rather a two-base insertion.

explains the abnormal phenotype. It is therefore critical to filtered bam file. This approach dramatically improved indel
understand the limitations of the alignment tool used, along detection without an increase in false-positives, since the
with the specific parameters utilized with that tool. common sequencing artifacts with the Illumina sequencing
This critical limitation is further highlighted in the pipeline technology are single nucleotide changes. It is unclear how
data quality filtering steps, where medium and large insertion these parameters would need to be set for other technolo-
containing reads are filtered out of the data set even after gies that have different common sequencing artifacts.
successful mapping. To overcome the limitations introduced When a large insertion or deletion was found in both read
by this filtering step, while preserving the crucial ability to pairs, neither read was able to map, which made detection
filter out low-quality data and therefore detect low-level nearly impossible. The discrete nature of amplicon generated
SNVs, we moved large indel detection to the preequality data with identical ends makes it possible to “count” the
448 R. Daber et al.

number of reads captured for each amplicon in the unpro- References


cessed data. Utilizing this knowledge, we developed a tool
called GarbagePicker, which determined the number of 1. McCourt CM, McArt DG, Mills K, et al. Validation of next gen-
reads falling into both unmapped and mapped portions and eration sequencing technologies in comparison to current diag-
allowed identification of amplicons that showed abnormal nostic gold standards for BRAF, EGFR and KRAS mutational
mapping patterns. Upon manual review of these “flagged” analysis. PLoS One 2013;8:e69604.
amplicons, we were able to identify very large insertions in all 2. Ross JS, Ali SM, Wang K, et al. Comprehensive genomic
of the artificial data sets tested. This approach would not profiling of epithelial ovarian cancer by next generation
work with NGS data generated by using a random sheering- sequencing-based diagnostic assay reveals new routes to tar-
geted therapies. Gynecol Oncol 2013;130:554e559.
based library prep, but is very powerful for amplicon-based
3. Shanks ME, Downes SM, Copley RR, et al. Next-generation
assays.
sequencing (NGS) as a diagnostic tool for retinal degeneration
Single nucleotide variant detection also proved to be a reveals a much higher detection rate in early-onset disease. Eur
challenge with our artificial data sets, since the major variant J Hum Genet 2013;21:274e280.
calling tools were designed to identify heterozygous or ho- 4. Weiss MM, Van der Zwaag B, Jongbloed JD, et al. Best practice
mozygous changes in low read-depth data sets (i.e., guidelines for the use of next-generation sequencing applica-
constitutional changes). The inability to reliably identify vari- tions in genome diagnostics: a national collaborative study of
ants with low allele frequencies prompted us to develop a Dutch genome diagnostic laboratories. Hum Mutat 2013;34:
simple tool that utilized base counts at every position to 1313e1321.
generate a variant call file. With this approach, we created a 5. Wong LJ. Challenges of bringing next generation sequencing
technologies to clinical molecular diagnostic laboratories. Neu-
program called AbsoluteVar, which enabled us to define
rotherapeutics 2013;10:262e272.
allele frequency, allele depth, and allele quality score cutoffs
6. Fernald GH, Capriotti E, Daneshjou R, et al. Bioinformatics
to capture all genomic locations where a non-reference allele challenges for personalized medicine. Bioinformatics 2011;27:
was present to meet the specified criteria. This tool also 1741e1748.
allowed us to identify sites with multiple non-reference alleles 7. Singh RR, Patel KP, Routbort MJ, et al. Clinical validation of a
present, even at low allele frequencies for those bases. next-generation sequencing screen for mutational hotspots in 46
Validating the bioinformatics process of a clinical NGS cancer-related genes. J Mol Diagn 2013;15:607e622.
test is something that can be achieved without a significant 8. Tsiatis AC, Norris-Kirby A, Rich RG, et al. Comparison of Sanger
cost to the laboratory by using artificial data sets. It also al- sequencing, pyrosequencing, and melting curve analysis for the
lows the laboratory to test mutation detection across many detection of KRAS mutations: diagnostic and clinical implica-
tions. J Mol Diagn 2010;12:425e432.
different genomic loci, and with a range of mutation types
9. Yeo ZX, Chan M, Yap YS, et al. Improving indel detection
that are impossible to obtain from biological samples. This
specificity of the Ion Torrent PGM benchtop sequencer. PLoS
approach may also be a useful mechanism of proficiency One 2012;7:e45798.
testing to ensure high-quality mutation detection capabilities 10. Grimm D, Hagmann J, Koenig D, et al. Accurate indel prediction
and to define a laboratory’s limits of detection due to the using paired-end short reads. BMC Genomics 2013;14:132.
processing pipeline alone. Ultimate validation must still 11. Kong Y. Btrim: a fast, lightweight adapter and quality trimming
include analysis of real samples. For many labs, however, program for next-generation sequencing technologies. Geno-
samples used for assay validation are only characterized by mics 2011;98:152e153.
a handful of loci, making false-negatives in novel regions 12. Li H, Handsaker B, Wysoker A, et al. The Sequence Align-
arising from inadequate data processing a real problem. ment/Map format and SAMtools. Bioinformatics 2009;25:
2078e2079.
In conclusion, next generation sequencing is a powerful
13. Ramirez-Gonzalez RH, Bonnal R, Caccamo M, et al. Bio-sam-
approach that is poised to alter the paradigms in the practice
tools: Ruby bindings for SAMtools, a library for accessing BAM
of all fields of medicine. The technologies are incredibly files containing high-throughput sequence alignments. Source
robust and continue to improve. Even in the presence of Code Biol Med 2012;7:6.
immature data processing pipelines, countless discoveries 14. Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for
and an unprecedented number of diagnoses have been comparing genomic features. Bioinformatics 2010;26:841e842.
made. As the field of bioinformatics matures and begins 15. Danecek P, Auton A, Abecasis G, et al. The variant call format
producing clinical-grade solutions, it is critical to validate the and VCFtools. Bioinformatics 2011;27:2156e2158.
informatics portion of an NGS assay to define the actual 16. Cingolani P, Platts A, Wang Ie L, et al. A program for annotating
limits of variant detection as it relates to single nucleotide and predicting the effects of single nucleotide polymorphisms,
SnpEff: SNPs in the genome of Drosophila melanogaster strain
changes, insertions, and deletions before clinical imple-
w1118; iso-2; iso-3. Fly (Austin) 2012;6:80e92.
mentation. In addition to initial validation of the informatics
17. Wang K, Li M, Hakonarson HANNOVAR. functional annotation
pipeline, swapping data sets among various laboratories can of genetic variants from high-throughput sequencing data.
be a robust tool for quality assurance and proficiency testing. Nucleic Acids Res 2010;38:e164.
18. McKenna A, Hanna M, Banks E, et al. The Genome Analysis
Toolkit: a MapReduce framework for analyzing next-generation
Supplementary data DNA sequencing data. Genome Res 2010;20:1297e1303.
19. DePristo MA, Banks E, Poplin R, et al. A framework for variation
Supplementary data related to this article can be found at discovery and genotyping using next-generation DNA
http://dx.doi.org/10.1016/j.cancergen.2013.11.005. sequencing data. Nat Genet 2011;43:491e498.

You might also like