Exome Sequence Analysis and Interpretation

Exome sequence analysis
and interpretation
Handbook for Clinicians
1st Edition
________
Vinod Scaria
Sridhar Sivasubbu
Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation
Like us on Facebook
https://www.facebook.com/clinicalexome
1st Edition (2015)

Version 1.01
Scaria V and Sivasubbu S
Exome sequence analysis and interpretation

The entire surplus from the sale of this book in will go to support
advancing research in genomics.
This work is licensed under a Creative Commons AttributionShareAlike 4.0 International License.
Cover Image: Artists impression of Nucleotides in a DNA strand.

Oil on canvas by Pradha (2015)
Acknowledgements
A number of individuals have contributed to this book
in personal as well as professional capacities. This include
graduate students from our groups, especially Mr.
Shamsudheen Karuthedath Vellarikkal, Mr. Rijith Jayarajan,
Mr. Ankit Verma, Ms. Saakshi Jalali, Ms. Heena Dhiman and
Mr. Kandarp Joshi who have helped in collating content, and
figures which enrich the manuscript. Authors also thank and
acknowledge critical comments, editorial help and support
from our colleagues, Dr. Vamsi Krishna, Dr. Adita Joshi, Dr.
Srinivasan Ramachandran, Dr. Jameel Ahmad Khan and Dr.
Abhay Sharma.
Authors thank the Genomics for Understanding Rare
Diseases- India Alliance network (GUaRDiAN) and
collaborators for critical insights, which significantly enriched
the outlook and content of this book. Authors thank an
innumerable number of patients and families who have
interacted with us through the network, without which our
insights and knowledge would have been limited.
The authors acknowledge the financial support from
the Council of Scientific and Industrial Research, (CSIR), India
through grant BSC0212 (Wellness Genomics Project). The
funding agencies had no role in the preparation of the
content or the decision in publishing this book. Authors
declare no competing financial interests.
Dedication
Dedicated to the innumerable patients and families who
enriched our knowledge and insight through their close
interactions, shared their distress like a family member,
contributed samples to research selflessly, without which we
would not have been what we are, and we would not be doing
what we do, and would not be writing what we wrote.
Contents
Contents .................................................................................................9
Foreword ............................................................................................... 11
Case of the Bhai .................................................................................... 13
The human genome project and how it changed everything .............. 19
Genome variations and how they makes us different? ........................ 29
A brief introduction to next generation sequencing ........................... 37
When you could sequence your own genomes .................................... 43
So what if we could sequence just the protein coding genome? .........49
When should you do exome sequencing? ............................................ 55
When should you probably not do exome sequencing? ...................... 61
First things first: putting insights before data ..................................... 65
Educating the patient and getting an informed consent..................... 71
Points to note when you outsource exome sequencing ...................... 81
Understanding the steps in analysis of exome sequence data ............ 85
How good is the exome sequencing data? ........................................... 91
Prioritizing, annotating and interpreting variants .............................. 95
Don't forget the validation ................................................................ 103
Ethical considerations in whole exome sequencing ........................... 107
The last word ...................................................................................... 113
Index.................................................................................................... 115
10
Foreword
I would easily pick Next Generation Sequencing as one of
the techniques that had an immediate and immense application
in research and healthcare. Within a span of five years, almost
every scientist and physician cannot afford ignorance of exome
sequencing. With newspapers and internet screaming genome
everywhere, this handbook by Dr. Scaria and Dr. Sivasubbu is
timely.
The introductory chapter on Bhai is a story of exome
sequencing that is lucidly told even to general public. It is really
important for everybody to know, let alone clinicians what
sequencing is and how human genome project has improved
our understanding of role of genetic variants in health and
disease. The authors then introduce readers to exome, clinical
importance of sequencing it and the situations where this is
helpful in patient care. At the same time they warn the
physicians not to get carried away. In the next chapter they
explain the basics of medical evaluation and how they remain
evergreen even in the current era.
It is important the patient is not taken for a ride by the new
diagnostic companies which did not exist the previous year.
Both clinician and the patient must be aware of what they are
doing with the new test and what they can expect in the form
of results. Probably both need to be involved thoroughly in the
consenting process.
For a researcher, the authors explain how outsourcing is not
easy despite having several service providers and detail in
simple terms how the large data can be analyzed. Chapters on
quality control and interpretation of variants serve the readers
to understand the intricacies of this technique. Independent
validation of the results is vital to apply this technique in clinical
11
practice, especially prenatal diagnosis. To conclude the authors

elegantly touch upon the ethical issues that cry for attention.
The Did you know text boxes spread throughout the book are
simply highlights of genetic milestones or common terms that
should be general knowledge.
With excellent medical and scientific background and
pioneering this technique in our country both scientifically and
socially, Dr. Scaria and Dr. Sivasubbu have done incredible job of
cracking the hard nut of exome sequencing and the book is a
must read for all clinicians and students of genetics.
Girisha KM
Professor and Head
Department of Medical Genetics
Kasturba Medical College, Manipal
Manipal University
12
Chapter 1
Case of the Bhai

The day after the Indian genome sequence was
announced1, we received a phone call from an individual
who introduced himself as Bhai2. A phone call from Bhai
comes with a lot of connotations, with the popular
imagination being that of the underworld calling for
extortion. Fortunately for us Bhai was neither part of the
underworld nor was he interested in extorting money,
for he must have been aware that we were not
millionaires to extort money from.
Bhai nevertheless had a bigger problem at hand.
He said that he had a skin problem and wanted us to talk
to his doctor. On close discussion, it was evident that he
suffered from an inherited genetic disease, which had
affected multiple members of his family. Days later, it
was understood from his physician that his family
suffered from a rare genetic skin disease called
Epidermolysis Bullosa (EB). EB encompasses distinct
disease subtypes with a variable severity ranging from
localized lesions to a more extensive or generalized
form. The disease is caused by defects in a number of
1
The sequencing of the first genome of an Indian was announced on

8 December 2009. Source:
http://www.pib.nic.in/newsite/erelease.aspx?relid=55470
th
Also published in: Patowary, Ashok, et al. "Systematic analysis and

functional annotation of variations in the genome of an Indian
individual." Human mutation 33.7 (2012): 1133-1140.
2
Bhai in Hindi and Gujarati means brother. It is a popular surname

attached to most Gujarati names. In colloquial terms, this would also
sometimes be attributed to an underworld don.
13
genes, mostly involved in maintaining the integrity of the

skin layers. Mutations in one of the genes would result in
fragility of the skin, thereby causing eruptions on the
skin, resembling those that occur after burns. These
eruptions or 'bullae' would sometimes break open, get
infected and result in scarring and sometimes extensive
pigmentation.
Bhai wanted the genetic lesion to be identified.
This was a complicated task to begin with. We had two
options, first was to systematically characterize the
mutation by sequencing every single exon one by one
using the conventional sequencing approaches, which
might have cost us a lot in terms of time, money and
effort; or use a genome scale approach without prior
hypothesis to sequence multiple genes in one go, and
possibly try to mine the
Did you know?
variation from the
haystack. A paradigm
Epidermolysis bullosa is a
shift in the approach
rare genetic disease of the
was in the anvil. We
skin presenting with blisters
on the skin. The disease runs
had
worked
in families and has an
extensively on setting
incidence of approximately 1
up sequencing on a
in 50,000 individuals.
new technology that
allowed
us
to
sequence
whole
genomes or parts of genome, which consisted of protein
coding genes specifically3. We also had laid our hands on
systematically analyzing the genome data for variants.
3
One technology to sequence part of the genome, which encodes for

proteins, is called exome sequencing. The concept of this methodology
forms the basis of this book and is detailed in the later chapters.
14
There was another technical issue. A close study

of the pedigree revealed that almost half of the family
members
spanning
almost
three
What is a pedigree chart?
generations
were
affected
with
the
Pedigree chart is a graphical
document that details the
disease suggesting an
ancestry of an individual. A
autosomal dominant
pedigree chart is a very
inheritance.
That
important tool to study the
would essentially mean
inheritance of diseases in a
the
variant
under
family over generations
question would be
potentially
heterozygous4. Now potential heterozygous variations
could be difficult to identify. On one hand, you would
require enough coverage5 to accurately call a
heterozygous variation. On the other hand,
differentiating a potential causative variation from a
number of other changes is a tedious and challenging
task.
There were also well-established workarounds for
these problems. One approach was to sequence two
affected individuals and see what set of variations
overlap between the datasets and probably prioritize
variations that could potentially change amino acids or
4
The human genome is diploid. That means we have two copies of

each chromosome, and therefore two nucleotides correspond to each
position in the genome. If both the nucleotides are not the same, that
means only one copy has a variation. Such a variation is called
heterozygous.
5
Coverage here denotes the number of times the sequence of the
genome has been covered or repeated.
15
functional regions. The other approach would be to

sequence one affected individual and use computational
approaches to prioritize variations that change amino
acid sequences and then potentially check whether the
variations are present in other affected individuals,
which included both affected and unaffected individuals.
We decided to pursue the first path. We sequenced
the
protein-coding
Did you know?
region of the genes
(exomes)
in
two
Keratin 5 (KRT5) is a
affected
individuals
cytoskeletal
protein
from two generations.
important for the integrity of
Systematic overlap of
skin. Mutations in KRT5 gene
the single nucleotide
can cause Epidermolysis
changes and filtering
Bullosa Simplex.
for
potential
alterations that could
have caused the disease, identified a variation in KRT5
gene6. Fortunately, the gene KRT5 was previously
associated with the disease. The variation was further
investigated in a number of affected and unaffected
individuals using conventional Sanger sequencing7 of the
region around the variation. Interestingly, the same
variation was present in all affected individuals but
absent in all unaffected individuals tested, supporting
Vellarikkal, Shamsudheen K., et al. "Exome sequencing reveals a

novel mutation, p. L325H, in the KRT5 gene associated with autosomal
dominant Epidermolysis Bullosa Simplex Koebner type in a large family
from western India." Human Genome Variation 1 (2014).
7
Sanger sequencing is a molecular technology for sequencing nucleic

acids, discovered by and named after Fred Sanger. The conceptual
methodology is detailed in the next chapter, later in this book.
16
our observation and conclusion that the variant is

causative of the disease in the family.
17
18
Chapter 2
The human genome project and how it

changed everything
The quest to understand the sequence of DNA
was pioneered by Frederick Sanger, who also received
the Nobel Prize in 1980 for the technique to determine
the same. This technique popularly known as Sanger
chemistry is practiced
Did you know?
till date and is based on
the
concept
that
Frederick Sanger (1918-2013),
modified
nucleotide
received two Nobel prizes in
bases could irreversibly
Chemistry, one for the
terminate
a
DNA
discovery of the amino acid
synthesis
reaction,
sequence of Insulin in 1958,
wherever they get
and the second one for the
incorporated.
The
sequencing technology in
principle is simple. One
1980, which eventually was
could clonally amplify
named after him.
short stretches of DNA
The Sanger Center, now the
and use the single
Sanger Institute at Hinxton,
strands as templates
which took a lead role in the
for DNA synthesis.
International human genome
Apart
from
pure
project, was founded in his
nucleotides,
the
memory. The Institute is now
synthesis
mixture
one of worlds largest
could be spiked with
genome centers.
abnormal nucleotides,
which are modified and
labeled.
These
abnormal modified nucleotides called di-deoxy
nucleotides could terminate a synthesis reaction
19
wherever they get incorporated by virtue of

complementary sequence in the template strand. This
chain termination would produce truncated products of
different sizes. Each of the products would be different
by one nucleotide and could be separated using gel
electrophoresis. Earlier, radioactively labeled bases were
used that enabled their detection using radiography, but
later non-radioactive modifications were developed that
allowed bases to be labeled either with specific
fluorophores or light emitting molecules. The overview
of the technology is summarized in Figure 1. This
methodology was perfected in 1970s and it was not until
a decade later that the technology matured and was
automated, fuelling the quest to sequence genome.
The Sanger sequencing technology saw a number
of improvements. The major improvement was the
automation and miniaturization of the technique. This
saw the birth of automated capillary sequencers. In
capillary sequencers, electrophoresis happened inside
capillaries and the electrophoresis bands were detected
using lasers. The automation significantly increased the
throughput of Sanger technology enabling sequencing
of larger genomes and is popularly dubbed the first
generation sequencing.
20
Figure 1. Conceptual overview of the Sanger sequencing

technology. The technology relies incorporation of labeled di-deoxy
nucleotides and chain termination.
21
Figure 2. One of the earliest sequencers that used the Sanger

sequencing methodology. The readout was obtained from the
vertical gel electrophoresis. Courtesy: The genomics museum at
CSIR- Institute of Genomics and Integrative Biology, Delhi, India
Figure 3. Automated capillary sequencer. Courtesy: The genomics

museum at CSIR- Institute of Genomics and Integrative Biology,
Delhi, India
22
The planning for

Did you know?
the ambitious Human
genome project was
Apart from the NIH led
started as early as in
Human genome initiative, a
the year 1984, but the
parallel effort to sequence
initiation of the project
the human genome was
initiated
by
a
private
with
appropriate
company, Celera Genomics,
funding started in the
led by Craig Venter. This was
year 1990. The United
initiated in 1998 and was
States Department of
estimated
to
cost
Energy (DOE) and the
approximately 3 million US
National Institutes of
Dollars, far cheaper than the
Health (NIH) jointly
NIH led effort. The draft
funded the project. The
assembly was released and
project was started to
published in the year 2001.
complete in 15 years
with a total outlay of
approximately 3 billion
US Dollars. Apart from the United States of America, the
project also encompassed an International consortium,
which included researchers from other countries
including the United Kingdom, France, Australia, Japan
and China.
The sequencing of the human genome involved
quite a cumbersome procedure. Initially, the genome of
3.3 billion bases was broken down into small fragments,
each of approximately about 150,000 bases and cloned
into bacterial vectors. These were further maintained
and replicated by the bacterial mechanism for DNA
replication. Each of these vectors were then sequenced
and assembled independently, before putting the pieces
together to assemble the chromosomes. This
23
methodology then came to be known as the hierarchical

shotgun approach.
Meanwhile,
Did you know?
almost
halfway
through the publicly
The draft human genome
funded
human
was jointly announced by Bill
genome project, a
Clinton, President of the
company
Celera
United States of America and
Genomics was formed
Tony Blair, the British Prime
Minister on 26th of June
in the year 1983. The
2000. The complete assembly
company
used
a
of the genome was later
radically
different
announced on April 14th,
approach that involved
2003
sequencing both ends
of the short DNA
fragments in a pair-end way, which was previously
successfully used to sequence small bacterial genomes.
The company promised to complete the genome
sequence, at a much smaller cost of approximately 3
million US dollars and compete with the International
consortium.
24
The
first
chromosome to be
sequenced
was
chromosome 22, one
of
the
smallest
chromosomes in the
human genome. The
chromosome
sequence
was
published in the year
1999.
Did you know?

Another noteworthy event
that happened in this
timescale was the Bermuda
declaration of 1996, also
known as the Bermuda
principles for early access to
DNA
information.
The
declaration set rules and
regulations for the early
public release of data
generated
by
the
International
Human
Genome Project in public
domain. This was a significant
shift from the well-practiced
principle of releasing the
data only after publication in
a peer-reviewed journal. This
declaration formed the basis
of pre-publication release of
genomic data, which is
widely practiced even today.
In March 2000,
the
draft
human
genome
was
announced by the
then US President Bill
Clinton jointly with
the British Prime
Minister Tony Blair.
The
papers
corresponding to the
publicly
funded
genome
and
the
Celera assembly were
published in the journals Nature and Science
respectively. Further improvements of the drafts were
announced in the year 2003.
The Human Genome Project was unique in many
ways. In one way, it was a mega-project that involved a
large number of researchers, not only from the United
States of America, who led the project, but also from
other countries across the globe, majorly from Britain,
25
France, China and Japan. The major aim of the project

was to provide researchers with a working template for
the human genome and provide them with tools and
resources to start understanding the basis of genetic
diseases in humans. The computational tools and
methods developed as part of the human genome
project also significantly helped in the completion of the
genomes of other organisms, including many model
organisms like mouse, rat, zebrafish, worm and fly,
which have been extensively used to understand human
diseases.
26
27
28
Chapter 3
Genome variations and how they make us

different?
The completion of human genome sequencing
led to two parallel large endeavors to understand the
human genome. One effort spearheaded the functional
characterization of the genome in terms of identifying
transcribing8 and regulatory elements9, whereas second
initiative focused on understanding the genomic
variability.
The human genome is quite large, over three
billion alphabets, comprising of four nucleotides:
Adenine (A), Thymine (T), Guanine (G) and Cytosine (C)
placed on a string. Though the genome is quite similar
between individuals, every one of us has changes and
this variability in the human genome sequence is what
largely makes us different. The number of variations
between individuals is quite large, approximately
3,000,000 or 3 million. Given the large size of the human
genome, this is approximately one variation in almost a
thousand bases. Many of these variations do not have
any impact in the functionality of the organism. Some of
Protein coding genes are transcribed to messenger RNA and further

translated to proteins.
9
Regulatory elements include regions in the genome that regulate the

expression of genes. Regulatory elements include promoters of genes,
enhancers among others.
29
them are quite variable in the human population10. We

should also note that the half of the genome is inherited
from each parent11. Variations in the genome inherited
collectively make us look, behave and sometimes act like
our parents. Therefore, many of the variations could be
surrogates of features that we inherit. Geneticists call
these features traits12. As you would also have guessed,
there are innumerable human traits. Many times,
cataloging these human traits is a complex and tough
task.
Understanding the genomic variations and its
association
with
Did you know?
human traits is by itself
quite complicated. On
The Celera project included
one hand, we need to
DNA from 5 donors selected
know the extent of
from a pool of 21 individuals.
genomic
variability,
The founder, Craig Venter
whereas on the other
was also part of the pool.
hand, we would need
to
know
which
variation or sets of variations are associated with a
particular trait. Sequencing a large number of individuals
to understand the genomic variability would be a
herculean task due to the costs involved and complexity
10
Variations that are quite variable in the population, i.e., have a

frequency more than 1% are popularly called as polymorphisms. Single
nucleotide variations that are polymorphic are therefore otherwise called
Single nucleotide polymorphisms or SNPs.
11
The human genome is diploid and one copy of each chromosome is
inherited from each parent.
12
Trait is defined as a quality or feature, especially of an individual. This

could be for example, hair color, color of the eye, height etc.
30
of executing such a large project. But without a grasp of

the genomic variability and an understanding of how the
genomic variability could affect human traits, the fruits
of genomics cannot be tested.
Now there were shortcuts available. The first
shortcut was that one could create a crude map of
genomic variations by putting together information
from multiple sources. The first source that scientists
had laid their hands upon was the sequence data itself.
The Craig Venter led genome assembly; popularly called
the Celera assembly was one large resource. Apart from
that, scientists had also put together sequences of
smaller regions, sometimes genes and parts of genes in
the public domain, and this created the next resource.
So there was something to start with.
The genome is not randomly inherited from a
parent to the child. Genes are inherited as blocks of the
genome, one from each parent. Hence, the variations
too are inherited in blocks. So if someone could study
common variations inherited in blocks, one could
identify the blocks that are associated with a trait. Thus,
we would be able to map the trait to the genomic region
encompassing the block. So if one had a family in which
a particular trait is inherited, say lack of the pigment
melanin in skin, hair and eyes (leading to a condition
called Albinism), one could theoretically study the blocks
of the genome inherited from each parent to child and
observe whether the people who had Albinism all
inherited the same block of the genome. This is a
somewhat complex approach, which geneticists call
linkage mapping. Since children inherit a large number of
traits from their parents differentiating each from one
31
another becomes a humongous task. However the task

becomes easier if one is lucky enough to identify large
families with numerous affected individuals spanning
multiple generations.
Now as we mentioned above, you could just
study common variations and blocks of genomes that
harbor them. These are called as tag variations13. Now
you could just study a small number of common
variations
to
understand
Did you know?
associations
with
common
diseases.
Polymerase chain reaction is
Well, before single
a
molecular
technique
nucleotide
changes
developed in 1983 by Kary
Mullis to amplify a piece of
were
employed,
DNA. This technique bagged
scientists
used
him the Nobel Prize for
something simpler to
Chemistry in 1993.
tag genomic blocks.
These were based on
typing repeats in the
genome. The locations of many of these repeats were
common in the population and one could use simple
techniques such as polymerase chain reaction (PCR) to
type these repeats and their lengths.
13
Tag Single nucleotide polymorphisms (SNPs) are representative

variations which mark a stretch of the genome.
32
The technological advancements in optics and

miniaturization
of
components
including
microprocessors and microelectronics that followed the
genome era also saw many of these being applied to
study genomics. The
earliest advent was the
Did you know?
creation
of
microarrays, which in
The DNA samples in the
the
last
decade
HapMap project involved
revolutionized
individuals from Yoruba tribe
in Ibadan, Nigeria, Chinese
genomics.
Scientists
from Beijing, Japanese from
learned that one could
Tokyo and people with
immobilize
small
European
ancestry
fragments of DNA onto
maintained at the Centre
glass slides14. Now
dEtude du Polymorphisme
these small fragments
Humain (CEPH) in France.
of DNA could be used
to
identify
single
nucleotide variations,
by the mere fact that a complementary nucleotide if
present could hybridize effectively. This became a quick
and popular assay for typing variations in the genome.
Further advancements in miniaturization saw higher
densities of packing such fragments of DNA onto slides,
and thereby enabling a larger number of variations that
could be typed.
The ready availability of microarrays to study
variations provides huge impetus towards the
understanding of genomic variations and associations
with human traits. These studies extensively used
14
This is popularly known as microarrays.
33
genome wide approaches to mark blocks in the genome

and are popularly known today as genome wide
association studies (GWAS). The later years saw the
discovery of a large number of variations and their
associations with human traits and diseases. This
approach still seems to yield quite fruitful dividends in
gathering genomic variations and their associations with
traits and diseases.
A number of global initiatives to map genomic
blocks and their associations have provided us with a
map of regions in the human genome associated with
distinct human traits and diseases in various
populations15. These efforts were notably the first
popular approaches to collect genomic variations
associated with human diseases.
Now coming back to the case of Bhai. While the
genome wide association studies were moderately
effective in mapping genomic blocks associated with
common diseases and traits, these approaches were
futile in the case of rare genetic diseases. This was
primarily because the genome wide association studies
relied on common variants and common traits, whereas
rare genetic diseases are caused by rare variants. In the
earlier sections, we had mentioned an approach using
15
Welter, Danielle, et al. "The NHGRI GWAS Catalog, a curated resource
of SNP-trait associations." Nucleic acids research 42.D1 (2014): D1001-D1006.
A visual representation of this map is available at URL:

http://www.genome.gov/gwastudies/
34
repeats, called microsatellites16. Microsatellite based

studies were the mainstay in mapping genes associated
with such rare diseases, and often was cumbersome,
time taking, costly, and the success was heavily
dependent on identifying large families. A typical
microsatellite study in a standard molecular biology
laboratory would take months for data generation and
analysis, which precluded its widespread application in
clinical settings for want of expertise and infrastructure.
16
Microsatellites are also called Simple Sequence Repeats (SSRs) or

Short Tandem Repeats (STRs). They encompass small stretches of 2-5
nucleotides which occur in tandem.
35
36
Chapter 4
A brief introduction to next generation

sequencing
After the announcement of the genome
sequencing, a silent revolution was taking shape at the
technology front. A bunch of researchers were working
hard to enable quick and cheap sequencing of
nucleotides. The traditional Sanger sequencing lacked
the speed and cost effectiveness to be able to sequence
genomes. A number of research labs around the globe
were approaching the problem in a variety of ways. The
field also saw the convergence of technologies from
multiple
areas
including
nanotechnology,
microelectronics and computing. These efforts led to the
emergence of a spectrum of approaches, each different
in their principle with their own set of limitations and
advantages, but similar in their goal of providing cheap,
fast and high throughput sequencing of nucleotides.
These technologies came to be popularly known as the
next generation sequencing (NGS), differentiating it
from the first generation sequencing technology, which
comprised of automated Sanger chemistry.
Briefly, Next generation sequencing refers to a
gamut of sequencing technologies, which differentiate
themselves from the conventional Sanger sequencing in
terms of the technology employed, significantly higher
throughput of sequence generation, quality of the
sequencing and reduction in per-base sequencing costs.
37
One of the earliest NGS technologies used was

called massively parallel signature sequencing or MPSS,
developed by a company called Lynx Therapeutics as
early as in the year 2000. The MPSS technology is not in
commercial use anymore and is rather of historical
importance. One of the first commercial offerings in the
NGS space came from 454 life sciences. The commercial
454 sequencers were
Did you know?
launched in the year
2004. These systems
The
pyrosequencing
used pyrosequencing
methodology relied on the
approach to sequence
release of a pyrophosphate
nucleotides.
Short
with nucleotide addition. This
pyrophosphate is acted upon
fragments
of
by ATP sulfurylase and
nucleotides
were
produces
ATP
in
the
captured on beads and
presence
adenosine
5
clonally amplified in an
phosphosulfate. This ATP
emulsion covering the
reacts with Luciferin to
beads. The beads were
produce oxy-luciferin and
further deposited onto
generates light, which is
microtitre plates. The
captured by the camera.
bases were reversibly
added, which on each
cycle would release a
pyrophosphate that was detected by imaging the cell on
the microtitre plate, thus enabling scalability to
sequence millions of short stretches of nucleotides. The
sequencing technology became quite popular due to the
longer read lengths and high quality data. The 454
sequencing technology was eventually acquired and
marketed by Roche Diagnostics. Other two technologies
that came to the commercial space were the SOLiD
technology marketed by Life Technologies and the
reversible
termination
sequencing
technology
38
developed by Solexa and later acquired and improved

upon by Illumina in the year 2007. The SOLiD technology,
which stands for sequencing by oligonucleotide ligation
and detection, employed amplification of short stretches
of DNA using emulsion PCR and ligation-based chemistry
to sequence short stretches of DNA. The first
commercial SOLiD sequencers were launched in the year
2007. Though historically methods like the massively
parallel signature sequencing and colony sequencing
were the forerunners of modern and more popular NGS
approaches, many of these technologies are now in
vogue and primarily of historical interest or have very
specialized applications. Nevertheless, associated tools
and methods including miniaturization, massive
parallelization and methods for assembling short
sequences still form the conceptual mainstay in the field.
These methodologies are detailed in the later section of
this book.
One of the popular and field tested technologies
practiced till date was that developed by Solexa. As
legend goes, a couple of British scientists met at a bar in
Cambridge over a pint of beer to chalk out a better
chemistry to sequence nucleotides in high throughput.
The informal summit at Panton Arms, dubbed by many
as the Beer Summit was where the most popular next
generation technology was chalked out. Shankar
Balasubramanian and David Klenerman put together
their chemistry and the laser detection expertise to
develop the reversible terminator based sequencing
technology. The startup Solexa provided flesh to their
concepts, and the Genome Analyser, a commercial
bench top next generation sequencer was born. The
basic technology could be summarized as follows. Short
39
pieces of DNA could be captured on solid glass surface

using small adapters, and these stretches could be
amplified on the slide to produce clonal bunches of DNA
stretches.
These clonal bunches of single stranded bases
could then be further used as templates for DNA
synthesis, cycle by cycle. In each cycle, a nucleotide
attached with a fluorophore is added. This addition is
recorded by imaging the slide. The fluorophore would be
then removed, and the cycle goes on for the entire
stretch of the DNA template. The series of images,
which were recorded, would further be analyzed using
computers to reconstruct the sequence of the stretch of
DNA. The computer would systematically go through the
images, cycle by cycle and reconstruct the order of
nucleotides from the fluorophore that shined up at that
particular cycle.
Figure 1. Overview of the Illumina NGS sequencing methodology
40
A number of other technologies and conceptual

methodologies also emerged in the later years. Of note
are technologies developed by Helicos biosciences17,
Pacific Biosciences and Ion Torrent. The Helicos
sequencer was released in the year 2009 but did not
become quite popular. The company later filed for
bankruptcy, putting the technology to oblivion. The Ion
torrent used a conceptually different technology, based
on estimation of pH on silicon wafers. The sequencer
was released in the year 2011 and the technology and
product was later acquired by Life technologies. Pacific
Biosciences also released a commercial sequencer in the
year 2011, based on single molecule sequencing
chemistry without amplification. The technology has
many advantages compared to others, in that the single
molecule chemistry obviates the PCR bias incurred in
other sequencing methodologies, and in addition,
provided very long reads, sometimes extending to kilobases. Such long reads have enormous applications like
detection of structural variations. Nevertheless, the
technology has not found widespread applications in
regular clinical settings, but is quite popular among the
research community, especially laboratories working on
genomes that are difficult to assemble.
A number of newer technologies are presently in the
anvil, and not yet available in the commercial space,
including Nanopore sequencing based on protein
nanopores for detection of nucleotide bases.
17
Helicos bioscience was co-founded in the year 2003 and imaged

individual DNA molecules. It also featured a chemistry, which prevented
incorporation of multiple nucleotides in each cycle, dubbed Virtual
terminator.
41
Figure 2. The Illumina Hiseq 2500 Next Generation Sequencer

Courtesy: CSIR Institute of Genomics and Integrative Biology, Delhi.
Figure 3. The Ion Torrent Proton Sequencer based on

semiconductor chips. Courtesy: CSIR Institute of Genomics and
Integrative Biology, Delhi.
42
Chapter 5
When you could sequence your own

genomes
Next generation sequencing was like a tsunami.
Though the early adopters of the technology saw its
huge potential, many of the traditionalists were quite
slow to realize the potential and the future. Sanger
sequencing was entrenched in many clinical laboratories
and was widely acclaimed for its reliability, quality and
ease of use, with automation being a standard. During
the early years, commercial next generation sequencing
platforms were fraught with frequent machine
downtimes, smaller read lengths, which practically
limited its applications and usually had lower quality of
reads compared to the traditional Sanger sequencing.
Nevertheless,
these
technologies provided
Did you know?
the
much-needed
Gordon E Moore, one of the
throughput to enable
co-founders
of
Intel
whole
genome
predicted that the density of
sequencing
in
a
transistors in an integrated
foreseeable trajectory.
circuit would double every
two
years.
This
was
commonly known as Moores
law.
The revolution
in
technological
advancements and the
resultant scale and
throughput was phenomenal, so much that at one point,
the speed at which the sequencing technology improved
in terms of throughput and cost -reduction was
comparable to the Moores law in the case of
43
microprocessors. The phenomenal increase in the

throughput and cost is depicted in Figure 1.
Figure 1. The dwindling cost of whole genome sequencing over the

years. The X-axis denotes the timeline, and the Y-axis denotes the
costs in US$ on a logarithmic scale. Data from
http://www.genome.gov/sequencingcosts/ Retrieved Feb 04, 2015
What came next was the race to sequence human

genomes. The first of course were the stalwarts
themselves - Watson and Venter, who sequenced and
made available their personal genomes. What came out
of the sequencing was an astounding number of novel
variations, which were hitherto not reported before. The
years that followed saw large genome centers drastically
shift to next generation sequencers and rapidly adapt
themselves to the avalanche of data. There were a few
new players also, notably the Beijing Genomics Institute,
which at a point in time was the largest genome facility
with over a hundred next generation sequencers.
44
The rapid technological advancements during this

period led to a major paradigm shift enabling genome
sequencing amenable to small research labs. For the first
time the power of genomics was being shared and
tasted by not so endowed laboratories, which did not
have the wherewithal to own and operate a large
inventory of sequencers and compute, leave alone
trained technicians and analysts.
Countries like India, which were not in the
forefront of technology during the initial human genome
sequencing initiative, were quick to adopt next
generation sequencing. What followed was a flurry of
human genome sequencing announcements from across
the world. The Chinese announced the Han Chinese
genome sequenced by the Beijing Genomics Institute,
while the Japanese announced the Japanese genome
and the Koreans announced the Korean genomes. India
was not far behind. The team from the CSIR funded
Institute of Genomics and Integrative Biology (CSIRIGIB), Delhi announced the first Indian genome. The
flurry of genome announcements continued..the
African Genomes, Sri Lankan, Malaysian, Russian so on
and so forth. Those were exciting times!! We would pour
through online announcements of genomes sequenced,
which were getting announced almost every month, and
see if we could put them up together to derive scientific
insights.
Being associated with the Indian genome
sequencing activity was a humbling experience. While it
taught us much of the nuts and bolts of genome
sequencing and analysis, it also provided immense
insights into how the genome sequencing could be
45
applied in the clinical practice. The costs of genome

sequencing were also dwindling drastically and a
thousand dollar genome and its promises were widely
discussed.
While individual whole human genome
sequencing would reveal approximately three million
variations, computational pipelines and datasets
available for analysis can functionally annotate only a
small portion of these variations. This has been primarily
because the functional annotation of variations is
dependent on computational methods that can predict
whether the variation can change the protein sequence,
structure and thereby their functionality. This would
essentially mean that the bulk of functional annotations
could be done for only variations that fall in protein
coding regions of the genome. This is detailed in the
next chapter. Having said this, it should also be
emphasized that methodologies to functionally
annotate and prioritize variations in regions of the
genome not coding for proteins also exist, though have
not been quite popular. Some of the early
methodologies for prioritizing variations in non-protein
coding genes have come out of our own laboratories. In
addition, a number of newer methodologies to annotate
functional variations in regulatory regions of the
genome also exist and have been widely used in
literature.
46
Figure 2. The CIRCOS representation of the first Indian genome

announced in 2009 and the title page of the publication.
(Patowary et al. "Systematic analysis and functional annotation of
variations in the genome of an Indian individual." Human mutation 33.7
(2012): 1133-1140.)
47
48
Chapter 6
So what if we could sequence just the

protein coding genome?
The previous chapter discussed the limitations in
analysis of whole genome data. So the natural question
is that if the present methods of functional annotation
are largely limited to just the protein coding regions of
the genome, then why not just sequence this part? Such
an approach has the potential to significantly reduce the
cost of sequencing, ease of handling data and analysis
and possibly implement it in clinical practice to aid
diagnosis. This is popularly called as exome sequencing.
An exome is defined as the protein-coding region
of a genome. In the human genome, the exome is
estimated to be approximately 1% of the genome or
roughly about 30 million bases. Since the proteins form
the major workhorse in the cell that modulate the
biological functions and outcome, sequencing just the
protein coding region of the genome offers a cost
effective quick solution to screen for genetic mutations.
A number of approaches have been in the anvil to
extract and sequence just the protein-coding regions in
the genome. Three major approaches are popularly
employed to extract specific regions of the genome
(also known as targets) for sequencing.
One approach would be to amplify specific
regions under question using standard polymerase chain
reaction. Usually, the reactions are multiplexed and
involve pools of primers that amplify selected regions of
49
the genome under question. The products following the

PCR reaction could be pooled together and sequenced.
This approach is widely used to amplify smaller regions
of the genome, but has limitations scaling accurately to
larger sizes of targets, for example whole exomes, due
to the fact that identifying optimum sets of PCR primers
with comparable efficiencies and high specificity is
challenging given the complexity of the human genome.
Figure 1. Conceptual outline of the gene structure with exons,

introns and the un-translated regions. The blue regions denote the
protein-coding regions, and the yellow regions denote the
untranslated regions. The transcript is spliced to form the
messenger RNA and then translated to functional protein.
Another popular approach has been the specific

capture of DNA corresponding to the specific regions
under question. This technique efficiently used the
principle of specific base pair complementarities to
50
isolate specific regions in the genome. The capture

reaction involves pools of single stranded nucleotides
attached to solid surfaces, either beads or on glass
surface. These pieces of nucleotides are designed to
have complementarities with the regions or targets that
require to be captured. Briefly, the genome is
fragmented using ultrasound or specific enzymes known
as restriction enzymes that can nick the DNA at specific
intervals. This produces DNA fragments of
approximately comparable sizes. The strands are then
denatured and only fragments with complementarities
to the stretches are isolated from the pool, thus
enriching only regions that fall in protein coding regions
as compared to the whole genome.
The targets are then processed for whole exome
sequencing following standard protocols. An overview
of the two popular approaches to enrich for protein
coding regions in the genome is summarized in Figure 2.
Though the approach seems to be simple and
logical, exome sequencing also has its share of
limitations. The first limitation is that it by design
precludes genomic variations falling outside of protein
coding regions, many of which are functional. The best
examples are promoter variations, which change
expression of specific genes and regulatory variations in
the untranslated regions that are known to modulate
expression of genes and stability of transcripts.
51
Figure 2. Conceptual outline of the two popular methodologies for

capturing specific regions in the genome. The first methodology
involves capture probes immobilized on solid surfaces, while the
second approach involves probes immobilized on beads.
Figure 3. Conceptual overview of the major steps in primary

analysis pipeline, which involves sequence quality check, alignment
of high quality reads to the reference genome.
52
The second major caveat of exome sequencing is

that specific types of variations cannot be accurately
typed. The best example could be chromosomal
abnormalities, especially when there is no net change in
the copy numbers. The best examples of such variations
being translocations and inversions. Since the capture
methodology enriches specific stretches of the genome,
without keeping the context of the genomic region it
came from, it would be impossible to decipher such
events, unless the breakpoint occurs within the protein
coding region, as in the case of the well-studied PMLRARa translocation in leukemia. Though new
computational tools enable the characterization of copy
numbers from exome sequencing data, it should be
emphasized that exome sequencing is still not the most
accurate methodology to look for chromosomal
abnormalities, which include a copy number change.
These limitations aside, sequencing just the
protein coding part of the genome has its advantages.
The first being the cost, which is significantly lower than
whole genome sequencing. The second being the
relatively small amount of data, which makes it easier to
handle and less complex to analyze without reliance on
huge computer infrastructure required to analyze
human genomes. The third advantage being the ready
availability of methods and tools to systematically
analyze data including online resources, which makes
analysis and interpretation a bit easier for clinicians.
53
54
Chapter 7
When should you do exome sequencing?

So the obvious next question would be when
should I do exome sequencing?
Let us go back to the case of Bhai. The molecular
diagnosis and confirmation of the disease would require
sequencing of approximately 20 amplicons using Sanger
sequencing approach in a traditional diagnostic setup.
Standardizing the PCR amplicons and performing the
sequencing is a tedious, time-consuming and sometimes
expensive proposition, which makes the accurate
molecular characterization of many diseases a challenge.
Advantages of whole exome
The second is a
sequencing
in
clinical
scenario where there
settings
are a number of
differential diagnoses.
1. Fast- 1-4 weeks turn around
There
are
many
time
examples for such
2. Holistic as it covers
cases in regular clinical
majority of known disease
settings.
In
such
causing gene loci
situations,
the
3. Cheaper in specific
accurate
molecular
situations
characterization and
diagnosis
of
the
disease would require sequencing of multiple loci and
genes, which on several settings, as in the previous
situation, might become tedious, time-consuming and
expensive.
55
Exome sequencing is an alternative new

approach in such scenarios for a number of reasons.
Exome sequencing is quite fast with commercial
turnarounds in the range of weeks, if not months. The
approach is holistic, in the sense that it covers a majority
of genes involved in Mendelian diseases. In addition, in
many cases, which involve a number of genes or exons
for confirmatory diagnosis, it might be cheaper than
traditional approaches.
The third scenario is where there is no diagnosis and
the presentation is quite rare, or there are multiple
affected family members or a situation, which involves
consanguinity. After exclusion of chromosomal
abnormalities and structural variations, exome
sequencing might be an interesting approach to follow
in such situations.
The fourth and probably the commonest case where
exome sequencing is warranted is when a definitive
clinical diagnosis has been made, but specific variant or
variants that are associated with the diseases are
reported unaltered. This would hint towards the
involvement of a novel variant or new gene loci, which
would benefit significantly from a holistic approach like
exome sequencing.
The fifth situation where exome sequencing would
be extremely beneficial is in cases where a specific
molecular diagnosis is expensive and possibly not
available in the specific local situation or country or in
cases where the timelines for diagnosis would not be
met by a conventional approach. Exome sequencing in
56
such cases would be useful on the economic front as

well as on grounds of speed and efficiency.
The sixth scenario is in the case of undiagnosed
diseases with a clear or
suggestive
genetic
The GUaRDIAN Consortium
cause. A number of
international
studies
GUaRDIAN
stands
for
have suggested that
Genomics for Understanding
whole
exome
Rare Diseases-India Alliance
sequencing would be a
Network. It is a consortium
useful proposition to
and network of clinicians,
clinical
geneticists
and
arrive at a definitive
genomics
researchers
diagnosis in cases of
formed with the aim to use
undiagnosed diseases.
the power of genomics to
Specific programs and
understand the molecular
studies
have
basis
of
rare
genetic
undertaken extensively
diseases.
exome sequencing to
identify undiagnosed or
More information on the
rare diseases. These
consortium and how it could
have provided insights
help you is available online at
and diagnosis to a
URL:
http://guardian.meragenome.com
significant number of
cases in a cohort.
There are a number
of research settings where exome sequencing would
benefit significantly. These are especially the cases of
genetic diseases, which present with atypical
presentations or additional features of otherwise
clinically diagnosed conditions, where the possibility of
finding novel variants and novel loci exits.
57
The other research application of exome sequencing

in clinical settings is in understanding the genetic basis
of rare genetic diseases. A number of recent studies
have shown that exome sequencing and whole genome
sequencing could be appropriate genomics tools
towards understanding the molecular dissection and
discovery of novel mechanisms and gene loci involved in
rare genetic diseases.
Figure 1. The quadrant where the optimum use of whole exome and
genome sequencing is recommended.
In addition, as rightly described in Figure 1, exome

sequencing has rightfully found its place in the discovery
of rare mutations with large effect sizes and genetic loci
associated with common diseases. Exome sequencing
has recently also been extensively used to discover rare
variants associated with common diseases. This has
been largely possible by sequencing individuals at ends
58
of the spectrum. A number of recent reports have

shown that this approach is powerful and could provide
a new opportunity to understand genetic variants with
large effect sizes.
59
60
Chapter 8
When should you probably not do exome

sequencing?
It should be noted that exome sequencing is not
a magic bullet that can enable diagnosis of all genetic
diseases; nevertheless it should be considered as a new
technological advancement, which can provide valuable
insights that can aid the diagnosis of majority of the
genetic diseases. Exome sequencing is not without
caveats. These limitations should be clearly understood
so that the expectations from whole exome sequencing
remain realistic.
The major caveat being that the approach can
only identify variations in protein coding regions of
genes. A number of genetic diseases are known to be
caused due to mutations in non-protein coding regions,
including non-coding RNAs. Most of the newer exome
sequencing panels also include untranslated regions,
promoters and in some cases non-coding RNA genes. It
should also be noted that many diseases are caused by
variations in the introns and splice junctions. These
might not be captured in a typical exome capture panel.
So a clear distinction and informed decision is warranted
before selecting exome sequencing as a method of
diagnosis for such diseases.
Contrary to expectations, not all genes are
captured in typical exome sequencing. A number of
exons, which encompass repeats or regions that have lot
61
of Gs and Cs or in regions that are repeat-rich cannot be

accurately captured and resolved by exome sequencing.
Exome sequencing is not useful in diseases
associated with chromosomal abnormalities and
structural variations in the chromosomes (with very few
exceptions). A large number of syndromes involve large
chromosomal abnormalities including copy number and
structural abnormalities. The capture methodology
precludes the identification of such chromosomal
abnormalities, especially ones that are not associated
with a net change in the chromosome number. The
exceptions in such cases are rare, especially ones
involving the breakpoint within the protein-coding gene.
Though standard pipelines for exome analysis are built
to analyze single nucleotide variations and insertion
deletion events, newer and specialized pipelines are
presently available to detect copy number changes in
chromosomes and breakpoints. It should be noted that
such analysis is still in the research domain and have not
been extensively applied in clinical settings.
A number of diseases are caused by repeat
expansions. The best-studied examples include
Huntington's disease and some Spinocerebellar ataxias.
Exome sequencing approach is not quite effective in
diagnosing such diseases. This limitation primarily arises
from the fact that most next generation sequencers are
not able to accurately resolve repeats, especially simple
repeats.
A number of diseases are caused by mutations in
the mitochondrial genes that show a unique feature
called heteroplasmy, which means that mitochondria
62
with multiple variations are present in the same cell.

Standard exome capture and analysis methodologies
significantly ignore the mitochondria, though some
capture methodologies also systematically capture
mitochondrial variations. In addition, pipelines for
analysis of mitochondrial variations are also available. If
you suspect a mitochondrial disease, and a maternal
pattern of inheritance, it would be worthwhile to start
with mitochondrial sequencing. A word of mention is
also essential that not all mitochondrial abnormalities
are caused by mitochondrial variations. A number of
nuclear genes are imported into the mitochondria and
mutations in these genes could also possibly manifest as
mitochondrial abnormalities, nevertheless with a
Mendelian pattern of inheritance.
A handful of rare diseases are caused by uniparental disomy. Usually the two copies of the genome
are inherited, one from each parent. In some situations,
both the copies of alleles are inherited from the same
parent. Typical exome sequencing would not be able to
identify whether the mutation came from one parent or
both.
63
64
Chapter 9
First things first: putting insights before

data
"Chance favors a prepared mind" -Max Perutz
The diagnosis of a disease is only as good as the
clinical work up you have done on the patient. Before
prescribing for exome sequencing, you should have your
options set and know what exactly your expectations
are. Exome sequencing is not a panacea for all
limitations for genetic diagnosis.
A complete family
history and pedigree.
Before you decide on exome

sequencing, collect the
following information
Lets come back

Complete detailed family
again to the case of
history and pedigree
Bhai. In the initial
Complete list of clinical

conversations with the
phenotypes and results of
primary physician and
clinical investigations
Bhai himself, the only
A
complete
list
of
information that could
differential diagnoses
be gleaned was that
only members in his
immediate family and close relatives were affected. On
multiple encounters and a close study of his distant
family tree over multiple visits and trips revealed that
the disease was running in a much larger family,
scattered over cities. Multiple coordinated attempts put
together the comprehensive family tree and it was
revealed that the disease has been running in the family
65
for generations, and involved more than a dozen

affected members. The family still remains the largest
reported family affected with Epidermolysis Bullosa in
India.
There is nothing better than a detailed family
history and a pedigree that can help clinch a clue and
assist a great extent in arriving at the right diagnosis.
The index case or parents might not be quite
forthcoming on the family history, or in many cases
might not be aware of the family history of the disease.
It would be worthwhile to spend some time closely with
the patient or other members of the family and collect
detailed information of all the relatives around them,
their health status including diseases, medications,
deaths and clause of deaths, miscarriages, abortions,
stillbirths and deaths in early neonatal and childhood.
Consanguinity18 is another key question. In many
cases the family might not be quite forthcoming on the
consanguinity as is it sometimes a norm in many
communities. In many cases all the relevant information
cannot be gathered in a single sitting as the patients or
parents might not be quite aware or might not recollect
facts. So it would be useful to possibly gather the details
over multiple interactions. If the patient or parents are
not educated, it would also sometimes be necessary to
ask pointed, but not suggestive questions regarding the
diseases, deaths and causes thereof in the family.
A good detailed pedigree can permit
hypothesizing the mode of inheritance of the disease,
18
Consanguinity means shared kinship or blood relation.
66
which would be extremely useful to prioritize variations

in the exome data. For example, the family tree of Bhai
helped us clinch a diagnosis of autosomal dominant
Epidermolysis Bullosa. The genetic variant thus is
expected in the heterozygous state in the exome, which
necessarily meant we could have easily prioritized by
sequencing two affected members in the family. A
detailed chapter on prioritizing variants is available in the
later part of this book. Similarly, a consanguineous
marriage would suggest the possibility of a recessive19
disease and also suggests for mapping of regions of
homozygosity (This is described in the later chapters as a
methodology to prioritize variations after exome
sequencing). The concurrence of disease in multiple
individuals in an outbred family suggests a possibility of
an autosomal dominant presentation, while a disease
passed on through the maternal lineage through
generations would suggest a mitochondrial mode of
inheritance.
A complete list of clinical phenotypes and clinical
investigations
Apart from the detailed pedigree, a thorough
clinical examination and enumeration of the clinical
findings is an important aspect that should not be
overlooked. In cases of clinical presentations like facial
dysmorphology20 or skin abnormalities, a detailed
description of the findings is necessary. It would also be
19
Both copies of the gene would require to be mutated to manifest an

autosomal recessive disease.
20
Dysmorphology is the study of birth defects, especially involving the
morphology of the body.
67
worthwhile to have clinical photographs of the features

to avoid ambiguity and to enable other clinicians or
clinical geneticists arrive at an independent conclusion.
In case of patients with
diseases manifesting
Did you know?
with abnormalities in
levels of metabolites in
The
Online
Mendelian
the blood, a detailed
Inheritance in Man (OMIM)
investigation towards
database is a comprehensive
this end is also an
online database of human
genes
and
disease
essential
clinical
phenotypes.
activity to the worked
upon.
A complete list of
differential diagnoses
The
clinical
findings
and
investigation reports
together with the
detailed
pedigree
forms the basic set of
clues enabling one to
arrive at a set of
differential diagnoses.
It would be worthwhile
to enlist a detailed set
of
differential
diagnoses before one
prescribes
exome
sequencing in clinical
settings. This would
enable
the
The work on collecting

Mendelian diseases and traits
was originally initiated by Dr.
Victor A. McKusick in 1960s
and was available initially as a
book. The electronic version
of the compendium was
made available online in the
present form from 1995
through the National Center
for
Biotechnology
Information.
The present OMIM is curated

and maintained by McKusickNathans Institute of Genetic
Medicine, Johns Hopkins
University
School
of
Medicine, USA.
68
prioritization of genes to be closely examined. Apart

from the list of differential diagnoses, a list of genes that
are involved in the disease also becomes handy while
analyzing the exome data. A list of potential genes
involved could be garnered from the Online Mendelian
Disease in Man (OMIM) database. Furthermore, a
number of locus specific variation databases enlisting
variants in these genes and their pathogenic effects
could be garnered from appropriate resources.
69
70
Chapter 10
Educating the patient and getting an

informed consent
Before prescribing exome sequencing, it is
imperative to explain the entire method, its benefits and
pitfalls. It is also imperative to inform the patient about
potential risks of uncovering unanticipated facts, which
could be gathered from the exome sequencing,
including risks of late onset diseases, cancers and
sometimes paternity. It would be therefore essential to
take both parents under confidence before the exome
sequencing is prescribed. A detailed information sheet
that explains a non-exhaustive set of circumstances and
or scenarios is appended at the end of the book.
The following major points need to be specifically
discussed with the patient before exome sequencing.
Samples collected: The patient need to be informed
how the samples would be collected (saliva, blood)
and what amount of samples would be collected.
2) The analysis performed on the samples also requires
to be explained to the patient. If any additional
genetic/epigenetic/biochemical tests are required to
be performed on the sample, this needs to be
mentioned and how such a test would help in
reaching the diagnosis.
3) Use of data and release: The patient requires to be
informed whether the data would also be used for
research and whether it would be released in a public
1)
71
database anytime. The benefits and risks of public

release also need to be discussed.
4) Risks
and
discomforts:
The
Did you know ?
risks
and
discomfort due to
Manuel Corpas, a researcher
the methodology
made available the genomes
of
sample
of himself and his family in a
collection,
or
freely available and re-usable
having the exome
format on the internet, with
the hope that people could
sequence available
download the data, analyze it
should also be
and obtain new insights on
explained in detail.
the genome. This was
A few scenarios are
popularly
called
the
explained below.
Corpasome.
Such
an
approach could potentially
make the genome analysis
and derivative information
up-to-date
and
comprehensive at any point
in time, with enormous
benefits in understanding the
disease predispositions and
or prognosis.
a. The availability of
the
sequence
could put one in
precarious
situations
including
identification of
an
individual,
inference
of
paternity,
The paper describing the
inference
of
dataset was published with
specific features
the following citation:
of the genealogy
and
possible
Source Code for Biology and Medicine
2013, 8:13 doi:10.1186/1751-0473-8-13
prediction
of
http://www.scfbm.org/content/8/1/13
risks to self and
children, and in
some times to
other
close
relatives in the family.
72
b. The information on the exome could be potentially

leaked from multiple sources electronic or otherwise,
which might have implications on the person and the
family.
c. If a previous genetic screen has been performed for
research or diagnosis, the exome sequencing would
make you identifiable in such a situation.
Anonymity and privacy: The patient should be

educated about the benefits and risks of being
anonymous, and potential advantages of being nonanonymous. Specific case scenarios of data being
publicly released as in the case of the Cospasome
could be discussed. If the patient requires being
anonymous, the methodologies and measures
whereby the anonymity would be maintained in a
specific clinical setting needs to be detailed to the
patient. The patient should also be educated that
privacy and anonymity are not inter-dependent
entities, and modern technologies could maintain
anonymity and privacy, while benefiting from public
release of the data. A recent paper from our
laboratory details this concept21.
6) Masking results: The patient could be asked for a
potential list of conditions or types of conditions,
which need not be screened on the genetic data
generated, and which might cause discomfort.
Nevertheless, the patient also requires to be
informed whether any of the diseases, which would
benefit from reporting and is part of the ACMG
5)
21
"Personal genomes, participatory genomics and the anonymityprivacy conundrum." Journal of Genetics (in press) available at URL:
http://link.springer.com/article/10.1007/s12041-014-0451-3
73
guidelines (Detailed in the later chapter) need to be

also excluded from the analysis or reporting.
Detailed consent provided to the patient and other
participants as part of the GUaRDIAN consortium is
enclosed below and would serve as a ready reference
guide.
74
RESEARCH CONSENT FORM

Reference Code:
Son/daughter/wife
of.aged..
Residing at
Hereby consent to freely participate in the genetic study aimed at

understanding the human genome. I have been informed about the
implications of my personal genome data being made publicly available
through public databases as well as scientific communications I have been
advised to discuss my participation in this study with my family members I
have been provided written information that may be circulated to them, if
necessary. I have been further informed that personal and medical data
collected during this study will be associated with my publicly available
genome and may be used for scientific analysis My participation in this
study is entirely voluntary and I am free to withdraw from this study as and
when I feel so inclined.
1.I choose to disclose / not to disclose my Identity (select one option)
2. I choose to be / not to be Informed of the results of the analysis that may
impact my health (Applicable only to those who have chosen to disclose
their identity select one option).
3. I choose to exclude the information attached on the "Exclusion Form"
from analysis / public disclosure (Applicable only to those who have chosen
to disclose their identity).
(Signature/ Thumb impression of volunteer)
(Date)
Certified that the above consent has been signed in my presence. The
purpose for which the sample will be used has been explained to the above
volunteer. The individual is free to withdraw from the study as and when
he/she feels so inclined.
(Signature of the investigator)
(Date)
75
Exclusion Form
I choose to exclude the following information from the questionnaire with
respect to
analysis or public disclosure (please indicate the rave/ant question numbers
from the
attached questionnaire)
1. Analysis.
2. Public disclosure:
INFORMATION FOR THE VOLUNTEERS
1.Purpose of study
The principle scientific goal of this study is to explore avenues to study
genetic variability between Individuals and to correlate the variability to the
phenotypes. The data generated (i.e., human DNA sequence, medical
information and physical traits) may be used for scientific and clinical
research such as development of computational tools and interfaces for
scientist, clinicians and individuals in addition to developing general public
awareness on potential benefits and risks of having whole genome level
information available to the public.
2. Enrolment procedures
A. Collection of baseline trait data:
You are required to provide baseline trait data about yourself,
including: data of birth, medications, allergies, vaccines, personal
and family medical history, race/ethnicity/ancestry and vital signs
(e.g. height, weight, blood pressure etc) in the attached
questionnaire.
B. Monozygotic twin:
If you have any identical twin(s), such sibling(s) will need to
provide consent for your participation in this research.
3. Tissue (Blood/Saliva) collection
A. Blood sample will be collected from the upper arm by
Venipuncture. Twenty-five ml of blood sample will drawn by an
authorized medical or an authorized technician under the
supervision of an authorized medical doctor, in the presence of
the principal investigator. Fresh blood sample will be collected in
designated containers (which will be provided by CSIR/IGlB).
Serum would be isolated from the collected blood sample for
biochemical analysis
B. Saliva sample will be collected by voluntary spitting. Two to
76
four ml of saliva will be collected in designated containers (which

will be provided by CSIR/IGIB).
4.Genomic analysis
Analysis of DNA RNA including but not limited to whole genome sequencing
and other biochemical analysis will be performed on tissue samples
collected from the individual. The nature and extent of analysis will be
determined by CSIR/IGlB at its sole discretion.
5.Public release of research data
Upon completion of genomics analysis, your DNA sequence data will be
made available through the CSIR/IGlB website and other scientific
communications (including but not limited to publication in scientific
joumals). This information is for research purpose only and may not be used
by you for any medical or clinical purpose unless the relevant research data
(DNA sequence) is first confirmed and discussed in consultation with a
health care professional. By signing this consent form, you hereby agree
and authorize CSIR/IGlB to proceed with the full public release of your
DNA/RNA sequence data and other information (data of birth, medications
allergies, vaccines, personal and family medical history, race/ ethnicity
/ancestry and vital signs) voluntarily made available by you, without any
legal restriction and without your further consent through CSIR/IGIB
website and database or other formats of standard scientific
communications (including but limited to publication in scientific journals),
and you hereby acknowledge the risk associated with the public release of
such data and information. Your identity will be held confidential if you
choose, even though the identity stripped information would be publicly
available.
6.Risks and discomforts
A.Venipuncture: This procedure is associated with minimal
discomfort and is free of significant adverse effects.
B.Data analysis: You are strongly advised to discuss this study and
the potential risks. as outlined below with your Parents, Siblings
and Descendants, hereinafter family members, as well as your
health care provider(s). You are also advised to directly discuss
any additional concerns with the Principal Investigator.
The following non-comprehensive list of hypothetical scenarios that could
pose risk for you and your family members:
i) The data provided by you (such as traits and vital signs or DNA
sequence data) may be used to identify you, resulting in higher
than normal levels of contacts from the press and other members
of the public. This could result in a loss of privacy and personal
77
time.
ii) Anyone with sufficient knowledge and resources could take
your DNA sequence data and or your personal trait information
and utilize the data, with or without modification, to (1) infer
paternity or other features of your genealogy, (2) reveal the
possibility of a disease or risk for a disease. Such information could
lead to social and financial consequences including but not limited
to employment and insurance.
iii) Your family members could also be subject to discrimination for
employment, insurance or financial service on the basis of the
public disclosure of your genetic and trait information.
iv) If you have previously made or plan to make available genetic
information In a confidential setting, the data provided by you as
part of this study may reveal your identity.
v) Any conclusions derived from the publicly available information
may be speculative with rasped to you and even less predictive
with respect to your family members. The complete set of risks
posed to you and your family members due to the public release
of the DNA sequence and trait data is not known at this time. We
encourage you to discuss this aspect with your family members.
7. Benefits
(i). At present there are no proven benefits to you for your participation in
this study.
(ii). This study may benefit the medical and research community in
particular, and humanity in general and may help in establishing genetic
causes and predisposition for common diseases.
(iii). You may experience satisfaction from participating in research that
may benefit medical science.
8. Intellectual property rights and benefit sharing

You will not be financially compensated for your participation in this study.
Neither you nor your heirs shall claim from CSIR/iGl8 any financial benefits
or rights, for any information, data, discoveries, whether or not of a
commercial nature, made using the information generated in this study.
However as per international (HUGO, UNESCO) and National Guidelines
(National Bioethical Committee, Ethical Guidelines for Biomedical Research
on Human Participants) it is necessary for national/international entities
deriving economic benefit out of the knowledge resulting by the use of the
human genetic material, to dedicate a percentage (e.g. 1%-3%) of their
78
annual profit for the benefit of the community/ public health.

9. Confidentiality
The results of this study may be published in a medical book, journal,
website or webpage or used for teaching purpose. Your name and other
identi6ers win be disclosed only if you have consented to disclosure of your
identity, You may not be notified by CSIR/IGl8 prior to such use.
10.Withdrawal of participation
Participation in this study is voluntary. You may withdraw your participation
and/or your data from this study at any time, as described in the consent
form. However once the DNA sequence and associated information is in
public domain it is likely to get disseminated widely and rapidly. Therefore it
may not be possible to retract the data in response to a withdrawal
request.
79
80
Chapter 11
Points to note when you outsource

exome sequencing
A large number of commercial enterprises now
provide whole exome sequencing as a service. As stated
before, there are large number of competing capture
methodologies and sequencing technologies, which
make the decision on the appropriate technology a bit
cumbersome and sometimes extremely challenging.
Nevertheless, the challenges aside, there are a few
questions that need to be kept in mind before
outsourcing exome sequencing in clinical settings. This
section is designed to provide a basic guideline on
specific points that are to be considered, and not as a
guide to select a particular methodology of technology.
The capture methodology and capture efficiencies
As mentioned before, it is a good point to keep
note of the target genes and exons captured as there
are a number of capture methodologies with varying
amount of bases captured in the genome and with
varying efficiencies of capture. This is important in the
context of patients with known genetic diseases, where
you are keen to look for a known variant or variants to
confirm the diagnosis. It is important to make sure the
genes and specific exons are covered efficiently in the
specific capture methodology under question. The
capture efficiency of the target region is also important
to be noted after the sequencing is being done. Details
of how to go about this are mentioned in the later
chapter on data analysis.
81
Sequencing technology, quality of reads and data

throughput
A number of sequencing technologies are
available in the commercial space. Therefore, it is
important to keep a note on the sequencing technology
employed before you finalize on the methodology. A
rule of the thumb is to go with a methodology that
would provide ample number of high quality reads at an
affordable cost. More on how to evaluate this after the
sequencing is performed is detailed in the later chapter.
Depth coverage of the target regions
In a regular clinical setting, for diagnosis of rare
genetic diseases, it would be worthwhile to have at least
100x coverage of the exome. This is due to the fact that
the capture efficiencies are variable across the genome,
and an average coverage of 100x would essentially have
in practical situations, almost all target regions
adequately covered to enable variant calling. It is also
imperative to look for what percentage of the target
region has good coverage to enable accurate variant
calling.
Availability of raw data and alignments
While outsourcing exome sequencing, one should
also insist that the raw data with qualities (preferably in
FASTQ formats) and alignments should be available. This
is an important consideration due to a number of
reasons. The first and the prime reason being that the
field is still naive, and so are the methodologies for
analysis. Apart from the information on the particular
variant in question, the exome also contains a number of
variants, many of which could also give insights and
82
additional clinical implications. Secondly, in many cases,

it is necessary to go back to the data and reanalyze at a
later point in time to arrive at an appropriate diagnosis in
light of disease progression and new clinical findings.
Variant calls, formats and interoperability
A number of service organizations offer the
variants in custom formats, usually in tab-delimited
formats or even excel sheets. It would be necessary to
keep a note that all variant calls be available in standard
interoperable formats. The commonly employed
standard format for variant calls has been the VCF
format. The VCF22 format includes all necessary
information to reanalyze the variants for prioritization,
especially the read coverage around the variant, the
variant quality and samples that have the particular
variants, in case of trios. Additionally, VCF formats are
interoperable and are accepted by most online resources
and software that aid the analysis of exome datasets.
Details of the analysis pipeline with parameters
The results of an exome sequencing analysis
could drastically vary depending on the analysis pipeline
employed and especially the parameters used for
sequence alignment and variant calling. To ensure that
the data is reliable and reproducible, it is imperative that
the report has accurate description of the analysis
pipeline as well as the parameters used in alignment, and
variant calling.
22
VCF stands for Variant Call Format. This format came into existence
after the 1000 Genomes project and is widely used in the community. A
number of bioinformatics tools and resources for analyzing variant data
take variant data input as VCF files.
83
Datasets used for annotation, versions and updating.

As much the analysis tools and parameters affect
the variant calls, the datasets used and their versions
also have a large impact on the conclusions derived.
Many of the datasets of genomes, genes and variants
are regularly updated and have non-trivial changes
between the versions released. It is thus important to
keep a note of the versions of the databases used so the
results could be appropriately interpreted and the
analysis be appropriately reproduced.
84
Chapter 12
Understanding the steps in analysis of

exome sequence data
The major steps in analysis of the exome
sequence data could be summarized as follows. The first
step involves quality check of the data. The second step
involves alignment of the sequence reads to the
reference genome. The third step would be the analysis
of the alignment to call variants and the fourth step
would be to annotate and analyze the variants. The
steps involved in the entire process are summarized in
Figure 1.
Figure 1. Summary of steps involved in the analysis of the exome.
The nucleotide data generated by the sequencer

is usually available in a file format known as FASTQ
(which stands for FASTA with Qualities). As you would
have imagined the file contains sequences with their
85
base qualities. The base quality part is important to note

here, because it tells how good is the sequence read,
and only a good quality read would provide you a good
quality variant for further analysis. The FASTQ files are
quite large, and in most cases cannot be opened on your
word processor or text editor. Nevertheless, it would be
worthwhile understanding what the file contains and
what it would mean. The FASTQ file would essentially
have 4 lines corresponding to each read, and there could
be millions of such reads in the file, arranged one after
another. Briefly, the first line starts with an @ followed
by the information on the read. This usually has
information of the sequencer, the run name, date, and
this might not be of use to you in a regular case. The
second line contains a string of ATGCs, which is
essentially the nucleotide sequence of the read. The
third line starts with a + and in some cases repeat the
information as in the first line, while sometimes it is
empty, to avoid redundancy. The fourth line, in many
cases contains characters that read like gibberish and
this is the representation of the quality of each base in
the read. So essentially the number of characters would
be exactly same in the second and the fourth lines, as
there is a quality representation for each read. The
gibberish is nothing but the ASCII character23equivalent
to the quality score.
23
ASCII stands for American Standard Code for Information

Interchange and it comprises of numerical representations corresponding
to a character.
86

@EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCCTTGGCAGGCCAAGGCCGATGGATCA
+
;;3;;;;;;;;;;;;7;;;;;;;88;;;;;;;;;;;9;7;;.7;393333
@EAS54_6_R1_2_1_540_792
TTGGCAGGCCAAGGCCGATGGATCA GTTGCTTCTGGCGTGGGTGGGGGG
+
;;;;;;;;;;;7;;;;;-;;;3;83;;3;;;;;;;;;;;;7;;;;;;;88
@EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGGCCCTTCTTGTCTTCAGCGTTTCTCC
+EAS54_6_R1_2_1_443_348
;;;;;;;;;;;9;7;;.7;393333;;;;;;;;;;;7;;;;;-;;;3;83
Figure 2. The FASTQ file format with sequences of the reads and
qualities of bases in the sequence read.
The quality of reads across the read length is

usually expressed as a Phred score. The Phred score is
nothing but ten times the negative logarithm of the
probability that the base was incorrect. So if the base
had a one in hundred chance for an error, which means a
0.01 probability of error, this would mean that the Phred
score would be 20 (as follows 10x-log(-2)). So a Phred
score of 30 would mean the base error probability would
be one in thousand and a score of 20 would mean a
probability of one in hundred and so on. There are a
number of ways you could evaluate the quality of data.
One approach is to plot the distribution of qualities at
every base, and this plot serves a ready reference to see
whether the sequencing was good or not. The quality of
sequences could be quite variable because of issues in
the library preparation or sequencing. If the reads on
first place have issues with quality or with sequencers of
the adapters used for sequencing, it is usually trimmed
to exclude low quality reads and this step is otherwise
known as trimming. How to verify the quality is detailed
in the later chapter.
The next step would be to align the good quality
reads to the reference human genome. The selection of
the genome version on build is very important as there
87
are non-trivial differences in the positions of nucleotides

and annotations of genes between the builds. A number
of computational algorithms have been used extensively
in literature to align the reads. The purpose of alignment
is to find the cognate position of the read in the
genome, and this would offer a way to compare
whether the nucleotide is same or different across the
read. As you would have rightly imagined, each genomic
position corresponding to protein coding exons would
be covered by a number of reads. This is otherwise
denoted as coverage, or how many times the nucleotide
is covered by reads.
Once you have aligned the reads to the
chromosome, you would find some positions that are
different in the reads compared to the reference
genome template. This information could be analyzed
using computers to derive which positions have a
variant. As you have rightly guessed, a higher coverage
would provide you with a better accuracy of the variants
called. So if you imagine a homozygous variant, all the
reads or rather majority of the reads would have the
particular variant, while in the case of heterozygous
variations approximately half the reads would have the
particular change with respect to the reference
template. This entire process is called variant calling. As
mentioned before, a number of computational
algorithms have been extensively used to accurately call
variations in the genome.
88
Figure 3. The alignment of reads to the reference genome. The

positions where the bases in the reads are different from those of
the reference genome is highlighted.
The variations in the genome are usually available

in a standard format known as the VCF. VCF stands for
Variant call format. A number of analysis software are
able to appropriately recognize the variant formats and
provide annotations to the variants in terms of
information that would help clinch the diagnosis. The
structure of the VCF file is summarized in Figure4.
89
Figure 4. The VCF file format representation of variations in the

exome.
90
Chapter 13
How good is the exome sequencing data?

There are three major parameters, which decide
whether the exome sequencing data is good or not. The
first one is of course, the quality of the sequencing
reads, the second one is the coverage depth across the
target regions and the third is the alignment percentage
across the genome.
The first parameter is possibly the easiest to
check. Thats the quality of reads. A number of tools,
both online and offline are available to check the quality
of bases. It should be noted that the distribution of
quality of bases is as important as the mean quality of
the bases. The scheme below shows the base quality
plot for a good set of sequencing reads. The scheme also
shows the advantage of looking at the distribution of
the qualities compared to the mean quality at each base
position.
Figure 1. Quality plots for good quality reads.
91
Figure 2. Quality plots for bad quality reads. Note the low quality of
reads towards the end.
The second important parameter to check would

be the coverage depth across the target region. On an
average, for identification of rare disease variants in
clinical settings, it is recommended to have at least 100x
coverage worth of high quality data. The calculation
would be dependent on the read length and total length
of exome capture (in case of whole exome it is
approximately 50 mb). For a 100 base read, this would
mean 5 million reads, and so on.
The third important parameter is the alignment
percentage. It denotes the percentage of the total reads
which aligned to the reference genome. On an average,
in a well-set experiment, more than 95 percent of the
reads generated after capture should align to the human
genome. At times, the percentage alignment could also
possibly cross 99 percent, with good quality data. A low
percentage alignment would mean a number of possible
92
things that could have gone wrong. One of the major

possibilities is contamination of the reagents. Other
possibilities could include inefficient capture or
sequencing. Adapter contamination is one of the first
things to look in case the reads show an abnormal
percentage alignment. An adapter contamination could
also be identified in the FASTQC report, which would
show over represented sequences. Over representation
of particular sequences, especially repeat sequences
would mean an improper capture or library preparation.
Apart from the total alignment percentage, the
coverage of the target site is also an important
consideration. For accurate variant calling, it is advised
to have a good coverage across majority of the target
sites. Skewed target coverage would mean an inefficient
capture procedure.
93
94
Chapter 14
Prioritizing, annotating and interpreting

variants
As described in the previous chapter, the real
analysis and interpretation starts after you lay your
hands upon the compendium of variations called from
the exome. Ideally we expect all variants to be in the
standard VCF format, which makes it compatible and
interoperable with most tools and resources available
online for exome analysis. But before we go right into
the thick of exome analysis, it would be imperative to
conceptually understand how to prioritize variations.
There are largely six approaches to prioritize variations
from exome or whole genome sequencing data and
these are summarized in Figure 1. The highlighted region
denotes the exome sequenced and the panel below
suggests the approach to filter or prioritize variations.
Such prioritization strategies could be employed
at any step and the selection of the approach is
dependent on the specific case. If there are multiple
affected family members as in the case of the Bhai, a
linkage-based strategy is useful. One could potentially
sequence multiple affected family members of the same
family, and if possible, unaffected members too. A
segregation-based strategy could be used to include all
variations overlapping in the affected individuals and
excluded in the unaffected individuals. Such an approach
would be extremely useful in autosomal dominant
diseases with multiple affected family members.
95
Figure 1. Summary of popular approaches or strategies to prioritize

variants from exome sequencing.
The second scenario as you would see involves a

consanguineous marriage and you would expect an
autosomal recessive pattern of inheritance of the
disease causing mutation. Here a homozygosity based
strategy, taking into consideration all homozygous
variants and prioritizing them through standard
pipelines would be the best approach to follow.
The third scenario involves a non-consanguineous
marriage with a probable autosomal recessive pattern of
inheritance, where filtering the exome by exclusion for
heterozygous variations could be the approach to
follow. In some cases where the affected child is not
available for testing, as in the case of abortions,
sequencing both the parents for heterozygous variations
associated with Mendelian diseases would be the
alternative to follow.
96
The further analysis of the exome data involves

majorly three steps. The first step involves
understanding variations, which cause a change in the
amino acid sequence of proteins and predicted to be
deleterious. The second step involves annotating the
genes with respect to the disease candidates and the
third step involves prioritizing variations using different
strategies as in the specific case.
The first step is to obviously find variations, which
could change the amino acid sequence of the protein
and are predicted to be deleterious. As you would also
have imagined, not all variations in the exome are
important or could have a functional consequence. The
variations that can cause a change in the amino acid
sequence are called non-synonymous variations, while
the variations that do not change the amino acid
sequence are called synonymous variations. Not all nonsynonymous variations are important. Only a small
proportion of the non-synonymous variations in the
exome change the amino acid sequence of a protein to
produce a functional effect. These are variations that
cause an amino acid change in regions of the protein
that are extremely important for the function or the
structure of the protein. These variations are generally
called deleterious variations. Now whether a variation
could potentially be deleterious or not, is largely derived
from computational predictions based on what amino
acid change is caused by the specific variation under
question. Two computational tools are popularly used to
prioritize deleterious variations. These includes SIFT and
PolyPhen2. The algorithms use similar, but distinct
approaches to annotate variations as deleterious or not.
97
SIFT stands for Sorting Intolerant from Tolerant

and uses evolutionary conservation of the amino acid at
the particular position in the protein to predict whether
the variation is deleterious or not. This is under the basic
assumption that if the amino acid is quite conserved at a
particular position in the protein, evolutionarily, a
change to a less frequent amino acid at that position
could be functionally deleterious and thereby
evolutionarily discarded. PolyPhen2 is yet another
algorithm to prioritize variations. The algorithm is a bit
complicated, and apart from the conservation of
position, also uses the structural context of the amino
acid and additionally uses artificial intelligence
methodologies to predict whether the change is
deleterious in nature or not.
Both approaches individually might not be quite
effective in prioritizing the variations. So one approach
that has been popularly employed by researchers is to
use a consensus of both approaches to prioritize
variations that are deleterious in nature. You should also
however note that while a consensus approach might be
highly specific, such a stringent approach might exclude
some variations that are functionally relevant and the
decision to use the tools in consensus or alone has to be
decided on a case-to-case basis. The online applications
that integrate these predictions are discussed later in
this chapter.
The second step is to annotate the variations and
genes associated with the disease phenotypes under
question. As mentioned in the earlier chapter, the
complete clinical details come in handy here. A number
of tools discussed later in this chapter can take in
98
additional annotation of the patient phenotypes to

prioritize variations.
There are two web-based resources that have
been extensively used to clinically annotate exomes. This
includes Exomiser maintained by the Sanger Institute
and PhenIX, both of which have been extensively used
by clinicians worldwide to prioritize variations and
possibly arrive at a diagnosis.
Exomiser has a web-based interface, where you
could upload the VCF variant file corresponding to the
exome. The web interface also provides an option to
upload exome variants from multiple samples in a family
with associated pedigree information in a specified
format. Briefly, you could upload the VCF file and
optionally the pedigree annotation if you are having
multiple individuals sequenced from a family. The
resource also features additional options where you
could input either the diagnosis of the patient or a set of
phenotypes in case the diagnosis is not sure. There are
additional parameters, which you could specify, and are
optional. This includes
1. Minimum variant call quality: You could specify a
Phred score, say 30.
2. Maximum minor allele frequency (%): This option
allows you to exclude common variations by allele
frequency. Could put a minimum allele frequency of
1%.
3. Remove off-target, intronic, synonymous variants,
dbSNP variants and non-pathogenic variants options
would allow you to exclude these variations from the
report.
99
4. Inheritance model: You could select the specifics if

you are sure about the inheritance of the disease and
this option is used to prioritize the variations.
Otherwise could select none to display all variants in
the report.
Figure 2. Screenshot of Exomiser with the different options.
Another similar resource that allows you to prioritize

variations is PhenIX maintained by the Charite in Berlin.
100
Figure 3. Screenshot of PhenIX with the different options.
PhenIX has an interface quite similar to that of Exomiser

and has an option where the user can input the
phenotypes or traits using the autofill option, upload the
VCF file and specify the inheritance model and the
maximum allele frequency.
Both tools prioritize the variations by
pathogenicity or deleterious effect of the variation(s)
and by similarity of the genes harboring these variations
to the genes associated with phenotypes provided by
the user.
101
Apart from the deleteriousness of the variation,

another parameter that would help clinch a diagnosis
would be the allele frequency of the variation in
populations. It is expected that most deleterious
variations in the population would be quite rare in
occurrence, so an allele frequency of less than 1 per 100
would be a quite legitimate frequency to choose to
prioritize variations. In many cases, it could also be
expected that the variation is novel and might not have
appropriate allele frequency information data.
102
Chapter 15
Don't forget the validation

The validation of the findings from whole-exome
sequencing is as important as the exome sequencing
itself. Most researchers and clinicians are not aware of
the fact that exome sequencing and analysis is also
fraught with its limitations. It is therefore necessary to
independently validate the variation before confirming
the diagnosis.
There are two scenarios where validation is to be
considered. In the first scenario, the variant is known
and implicated in the disease previously. Here the
validation is quite simple, in the sense, the finding needs
to be verified independently in the sample or samples.
Traditional Sanger sequencing approach is what is
commonly used in the field, especially for single
nucleotide variations. Polymerase chain reaction primers
could be designed around the variant under question
and the region could be amplified and sequenced to
confirm the diagnosis.
The second scenario is where you have identified
a new variant in a known gene. The first line of evidence
that would clinch on the variant would be segregation of
the variant in the affected members and a predicted
deleterious effect. Wholesome participation of members
of the family in such cases is required, and consent is
required to be obtained (detailed in the ethical
considerations section of this book). In some cases,
participation of other family members would be
impossible to obtain, due to privacy and anonymity
103
concerns. Another approach to validate a generic variant

would also be to see the segregation in a trio. In some
cases, especially in sporadic cases, and in specific social
circumstances, it might not be possible to approach
other family members or in some cases not even parents
but nevertheless the pathogenicity requires to be
proven unequivocally. In such circumstances, a number
of advanced methods have been adopted in literature.
These include validation of the finding using specific
assays at the protein level or at a cellular level using
advanced gene cloning, expression and sometimes
genetic engineering approaches. These technologies are
specialized applications, mostly in the research domain
and clearly out of the purview of this book.
The third scenario is where you stumble upon a
new gene and variant that causes a disease. While
segregation and or homozygosity mapping in cases of
consanguinity and filtering based on allele frequencies
could clinch a conclusive diagnosis, many cases also
leave a margin of error or doubt in the diagnosis and
implications of the genes involved. Functional validation
of such new genes is presently a realm of research
laboratories as no clear cut and wholesome
methodologies exist to systematically validate the
functional effects. Apart from the popular cell culture
systems, a number of research laboratories employ
model organisms to functionally validate the gene and
model the disease process. Model systems are useful to
validate the physiological processes, especially in cases
of developmental defects or structural abnormalities,
which would be difficult to validate in cell culture
systems. Nevertheless, cell culture systems are useful to
validate specific processes including metabolic pathways
104
and genes involved in specific processes at a cellular

level. The popular model organisms used to validate
disease genes include vertebrate and non-vertebrate
systems such as mouse, rat, zebrafish, fly and worm. Our
group employs zebrafish, which is a popular vertebrate
model organism for functionally validating the novel
genes.
105
106
Chapter 16
Ethical considerations in whole exome

sequencing
There are a number of ethical considerations that
have to be accounted for while performing and
analyzing the exome sequencing in clinical settings. This
is primarily because exome sequencing is unique in many
ways, compared to traditional diagnostic approaches.
For example, in comparison to most traditional
diagnostic approaches, the fine line between diagnostics
and research is quite blurred in the case of exome
sequencing. This is primarily because unlike other
diagnostic approaches, methodologies for exome
testing and validation are still not quite established. In
addition, since most of the clinicians would use exome
sequencing for understanding rare diseases, the
diagnostic accuracy in many cases cannot be established
due to the paucity of numbers and unique nature of each
patient. It should also be kept in mind that The basic
tenets of investigations in genetics has to be based on the
strong principles of beneficence, reciprocity, justice and
professional responsibility.
Three major areas are covered in the following
section of this chapter. This includes educating and
informing the patient, informed consent and handling
incidental findings, and anonymity and privacy of the
patient and family members.
Information and education
107
Educating the patient on the technology, analysis

process and interpretation is an important component.
The patients need to be educated about the possible
pitfalls, fallacies and limitations of exome sequencing. In
addition, the patient would also require to be informed
about incidental findings which could have clinical, social
and emotional implications and one should be equipped
to make an informed decision on the same. In addition,
the patient is also required to be informed that a genetic
testing of this sort could reveal information not just
about the patient or family, but also information, which
might be critical and relevant to other relatives in the
family and possibly the next generation. The pros and
cons of such information being available and
implications of the same also need to be addressed.
Incidental findings and reporting
Exome sequencing is unique compared to the
traditional research or diagnostic tests where the data
generation is comparative to the questions asked, or
rather, the chances of finding something incidental while
performing a test is meager. The first set of diagnostics
that started changing the paradigm was radiology,
where whole body scans started churning out
information than that was accurately required to answer
the clinical questions. The more the data generated, in a
generic form, the more incidental findings start to
appear.
Exome sequencing is unique in this respect that
the sequencing allows a comprehensive scan of all
variants in protein coding regions. This would include
apart from the variant or variants that help in the
diagnosis, other variants, many of which would have
108
clinical implications or relevance. Many of the resulting

findings may or may not have direct implications in the
condition at hand, but might have long-term
implications. One example could be variants that are
associated with drug metabolism or adverse drug
reactions. In some situations, the information might
have implications in early diagnosis or prognosis, as in
the case of inherited cancers. In many cases the
distinction between the incidental finding and the study
or target mutation under question also does not exist.
The traditional approach to such incidental
findings in clinical settings has been one of 'didn't look,
didn't find, don't report' where the onus was on the
doctor to decide what needs to be looked in the results
and report what he or she felt was good or relevant for
the patient. This paradigm might not always be the right
approach to follow because the incidental findings by
themselves could be of immense value to the patient,
and possibly to another doctor treating the patient, as in
the case of pharamacogenetic variants, which might
help in modulating the dosage of specific drugs under
question.
In addition, the case of exome sequencing is
unique compared to computed tomography (CT) scans
in another way. While computed tomography scans
could reveal in addition to the intended evidence,
additional incidental findings, the relevance of the
findings rarely change with time. In the case of whole
genome or exome sequencing, since the field by itself is
naive, and researchers are discovering new variants and
attributions in terms of their clinical relevance, almost
every day. Reanalyzing the exome sequencing data at a
109
later point of time could possibly reveal new findings of

clinical relevance. This unique situation would pose
another interesting paradigm, where reporting of the
exome is going to be a dynamic process, not an end
point or static process in contrast to many traditional
clinical diagnostic approaches.
The American College of Medical Genetics (ACMG)
formed a working group to deliberate on guidelines for
reporting incidental findings in exome and genome,
which was published recently. The working group
recommended the reporting of incidental findings for a
set of specified disorders, variants and class of variants
by evidence. This reporting is done irrespective of the
primary indication for exome sequencing.
American College of Medical Genetics and Genomics
Recommendations for Reporting Incidental Findings in
Clinical Exome and Genome Sequencing
A comprehensive description of the methodology,
recommendations, list of genes, variants and phenotypes is
available in the document entitled ACMG
Recommendations for Reporting of Incidental Findings in
Clinical Exome and Genome Sequencing
accessible at URL:
https://www.acmg.net/docs/ACMG_Releases_HighlyAnticipated_Recommendations_on_Incidental_Findings_in_Clinic
al_Exome_and_Genome_Sequencing.pdf
Apart from the incidental findings, the patient or

family members may decide to mask reporting on
110
specific regions or loci variations that might have nontrivial implications. The consent should include a section
where the patient or family members could explicitly
state this.
Anonymity and privacy
Utmost care on anonymity and privacy is another
important component of ethical conduct to the patient
and family. It should be emphasized that anonymity and
privacy are not two sides of the same coin, but are
separate entities. A detailed discussion with the patient
and family members is essential on this aspect. In many
cases, the impact of the genetic testing is just not limited
to the index case or family, but might have implications
in the genetic predisposition and disease manifestation
in the other family members too. Similarly, the
identification of a mutation might not be relevant to the
specific individual or family, but could be of relevance in
terms of screening and carrier detection in other
members of the family. As in the case of Bhai, the
identification of a novel mutation in KRT5 gene would
have implications in genetic screening and in some cases
prenatal screening with implications for the other
members of the family. In some cases the validation of
the genetic variant would require participation of other
members of the family, including people who might not
be affected with the disease.
With the advent of Internet support groups and
patient groups, in many cases the patient of the family
members do not like to be anonymous, since it might
benefit the larger community and society. In some cases,
the patient and family would like to remain anonymous
given the social stigma associated with the disease and
111
social implications with respect to other members of the

family. It is therefore the educated decision of the
patient or family that needs to be given utmost
importance. Questions in this direction need to be nonsuggestive, and should take into consideration the
social, emotional attachments and long term
implications.
112
The last word

Exome Sequencing is only a means, not an end. It
seemingly has a limited lifetime, largely popular and
widely adopted due to the cost advantage and ease of
analysis and interpretation. With dwindling costs and
improved throughput of sequencing, it is imperative, not
just plausible, that whole genome sequencing would be
the mainstay in diagnosis of genetic diseases.
113
114
Index
computational 16, 26, 46, 53,
88, 97, 121
computed tomography scans

109
454 38
computer 40, 53
coverage 15, 82, 83, 88, 91, 92,
93
CSIR-IGIB 45, 119, 121
Albinism 31
alignment 52, 83, 85, 88, 89, 91,
92, 93
Anonymity 73, 111
deleterious 97, 98, 101, 102, 103
anonymous 73, 111
diagnosis 12, 49, 55, 56, 57, 61,
autosomal 15, 67
65, 66, 67, 71, 73, 81, 82, 83,

89, 99, 102, 103, 104, 108, 113
diagnostic 11, 55, 107, 108, 110,
123
disease 11, 13, 15, 16, 34, 55,
Beijing 44, 45
62, 63, 65, 66, 67, 69, 83, 92,
Bhai 9, 11, 13, 14, 34, 55, 65, 67,
96, 97, 98, 100, 103, 104, 111
95, 111
DNA 19, 23, 24, 33, 39, 40, 41,
Bill Clinton 25
50
capillary 20, 22
Epidermolysis Bullosa 13, 66, 67
capture 50, 52, 53, 61, 62, 63,
exome 14, 16, 49, 51, 53, 55, 56,
81, 82, 92, 93
57, 58, 61, 62, 63, 65, 67, 68,
Celera 24, 25, 31
71, 72, 73, 81, 82, 83, 91, 103,
chromosome 15, 25, 30, 62, 88
107, 108, 109
115
Exome 1, 3, 16, 56, 58, 61, 62,
65, 108, 110, 113

Exomiser 99, 100, 101
Koreans 45
expression 29, 51, 104
KRT5 16, 111
FASTQ 82, 85, 87
leukemia 53
fluorophores 20
Malaysian 45, 121
genomic variations 31, 33, 34,
Mendelian 56, 63, 69, 96
51, 121
microelectronics 33, 37
GWAS 34
microprocessor 33
microsatellite 35
molecular 16, 35, 55, 56, 58, 119
molecular biology 35
mutation 14, 63, 109, 111
Helicos 41
heterozygous 15, 67, 88, 96
homozygosity 67, 96, 104
Human Genome 16, 25
Nanopore 41
next generation sequencing 9,
37, 43, 45, 119

imaging 38, 40
non-synonymous 97
inherit 30, 31
nucleotide 16, 19, 30, 32, 33, 40,

41, 62, 85, 88, 103
inheritance 15, 63, 66
nucleotides 15, 19, 21, 29, 35,
Inheritance 100
37, 38, 39, 40, 41, 51, 88
inversions 53
Ion Torrent 41, 42
116
108, 109, 110, 113, 119, 121,
123
Shankar Balasubramanian 39
outsourcing 11, 81, 82
shotgun 24
SIFT 97, 98
silicon 41
Solexa 39
Pacific Biosciences 41
SOLiD 38
PCR 32, 39, 41, 50, 55
Sri Lankan 45
pedigree 15, 65, 66, 67, 68

Phred 87, 99
polymerase 32, 49
PolyPhen2 97, 98
Tony Blair 25
privacy 73, 103, 107, 111
trait 30, 31, 34
pyrophosphate 38
translocations 53
trimming 87
R
U
regulatory 29, 46, 51

restriction 51
United States 23, 25
Russian 45
variation 14, 15, 16, 29, 30, 46,
Sanger 16, 19, 20, 21, 22, 37, 43,
69, 97, 98, 101, 102, 103
55, 99, 103
VCF 83, 89, 90, 95, 99, 101
sequencing 1, 9, 11, 12, 14, 16,
Venter 31, 44
20, 21, 22, 23, 24, 29, 37, 38,

39, 40, 41, 43, 44, 45, 46, 49,
51, 53, 55, 56, 57, 58, 61, 62,
63, 65, 67, 68, 71, 73, 81, 82,

83, 87, 91, 93, 95, 96, 103, 107,
Watson 44
117
118
About the authors

Sridhar Sivasubbu
Scientist,
CSIR Institute of Genomics and
Integrative
Biology
(CSIR-IGIB)
Web: http://sridhar.rnabiology.org
Email: s.sivasubbu@igib.res.in
Sridhar Sivasubbus laboratory is interested in exploiting the

advantages of zebrafish to dissect molecular mechanisms of gene
function, regulation and genome organization in vertebrates.
Research activities in his lab include deciphering non-coding RNA
mediated regulation of blood and blood vessel development and
development of zebrafish models for application in personalized
and precision medicine in humans. His group is actively involved in
mapping the genome and transcriptome of the wild zebrafish. His
group was also responsible for the whole genome sequencing of
human samples from India and other Southeast Asian countries.
Sridhar did his PhD from M.S University, Tirunelveli, India and
postdoctoral research at the Center for Cellular and Molecular
Biology, India and the University of Minnesota, USA. He is a faculty
at the CSIR-Institute of Genomics & Integrative Biology (CSIR-IGIB)
since 2006. Sridhar also served as the CEO of The Center for
Genomic Application, a Public-Private partnership company
established by CSIR-IGIB for enabling research in the field of
Genomics and Proteomics, where he spearheaded the application of
next generation sequencing technology for commercial projects.
119
120
About the authors

Vinod Scaria
Scientist,
CSIR Institute of Genomics and
Integrative Biology (CSIR-IGIB)
Web: http://vinodscaria.rnabiology.org
Email: vinods@igib.in
Vinod Scaria is a clinician turned computational biologist. His

laboratory is interested in understanding the function, organization
and regulation of vertebrate genome, and how genomic variations
could potentially impact them. He is also involved in creating novel
methods and resources for analysis and annotation of genomes and
understanding the functional impact of genomic variations. He has
been part of collaborative genomics projects aimed at
understanding the Asian Genome diversity. He has also been part of
the whole genome sequencing and analysis projects including the
Indian, Sri-Lankan and Malaysian genome projects and is also a
member of the HUGO Pan-Asian Population Genomics Initiative
task-force. He has adopted novel and creative strategies, such as
the use of social media, and the participation of a large number of
undergraduate students in collaborative projects to accelerate
genome annotation and co-creation resources for genome
annotation.
Vinod did his undergraduate medical education from Calicut
Medical College, University of Calicut and PhD in Computational
biology from University of Pune. Vinod has over 80 peer
publications in international peer-reviewed journals and two bookchapters to his credit. He is also in the editorial board of PLoS ONE,
PeerJ, Journal of Translational Medicine and Journal of
Orthopaedics (Elsevier). He is also recipient of the CSIR Young
Scientist Award for Biological Sciences in 2012. He was a member in
the senate of the Academy of Scientific and Innovative Research
(AcSIR)
121
122
Reaching the authors

This book was written keeping in mind how genomic technologies
could translate to patient-care. The authors would be happy to extend
their expertise and resources to help the diagnosis of patients with
rare genetic diseases. Interested clinicians and patient groups may
kindly contact us for further discussion.
You could reach us at:

Email: sridhar@igib.in OR vinods@igib.in
Register yourself to the Clinical Exome Group

We have set up a unique Readers club to keep you updated about
the new versions of this book and recent developments in the field.
It would also be a unique opportunity to share your issues and find
answers to your issues with exome sequencing and analysis and also
discuss interesting cases with experts in the field.
To register, follow this link: http://goo.gl/o9aAfC
You could also leave your comments on our Facebook page:
123
What readers have to say......

"The book is very well written, concise and provides an excellent
collection of data capturing the transition of one era into another. Due
emphasis was given towards the limitations of NGS along with its
widely acknowledged benefits. It helps one to understand the basics
of whole exome sequencing from a realistic viewpoint. Each chapter is
well constructed and systematically elucidates situations where WES
would be useful. Moreover, it provides an impetus for the clinicians to
understand their contributions towards accurate phenotyping for
better understanding of the genetic variations in a diagnostic set-up"
Yenamandra Vamsi Krishna,
Department of Dermatology,
All India Institute of Medical Sciences, Delhi
Let us know what you have to say about this book on our
Facebook page:
124
Scaria V and Sivasubbu S (2015)

Exome Sequence Analysis and Interpretation
This work is licensed under a Creative Commons
Attribution-ShareAlike 4.0 International License.
Cover Image:
Artists impression of Nucleotides in a DNA strand.

Oil on canvas by Pradha (2015)

Exome Sequence Analysis and Interpretation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Exome Sequence Analysis and Interpretation

Uploaded by

Copyright:

Available Formats

Exome sequence analysis

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

1st Edition (2015)

Scaria V and Sivasubbu S