Epinomics-Cassandra Summit Submit

Anupama Joshi/Matt Negulescu
Cassandra/Spark solutions for Genomic big data Analysis and Visualization
Introduction
Anupama Joshi Technology Infrastructure and
Execution at Epinomics
ajoshi@epinomics.co
http://www.linkedin.com/in/anupamajoshi
Matt Negulescu - Product Requirements and User
Interaction
mnegulescu@epinomics.co
DataStax, All Rights Reserved.
Drag picture to placeholder or

click icon to add
1
2
3
4
5
Introduction
What is Epigenomics?
Genomic Data and Epigenomic Data
Why Cassandra?
Demo
Epinomics
A platform that drives

personalized medicine
by leveraging big data
analytics and proprietary
epigenomics technology.
What is Epigenomics?
The study of modifications that turn genes on or of, without affecting the
DNA sequence.
Genomics
DNA is the hardware of the body:
static and descriptive (i.e. nature).
Epigenomics
Epigenome is the software layer:
dynamically turns genes on or off (i.e.
nature and nurture).
Typical Genomic data

Typical genomic
sequencing data contains
the protein letters ATCG .
Most research work focuses
on variation from
standard genome
sequences.
From: NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"
Epigenomic Data
Peaks Data
Regions of the genome where DNA was accessible
during the experiment.
chr1 713701
chr1 804976
714600
805650
peak.1
peak.2
899 +
674 +
Footprinting Data (Transcription factors)

A signal indicating a protein (i.e. a transcription factor)
binding to the DNA.
Why Cassandra?
High performance
Fault Tolerant
Linear scalability
Dynamic columns
Structured and unstructured
data
Flexible data model
Real time querying
Epinomics Cluster
6 nodes
@600gb and growing
2 datacenters
Epinomics Pipeline
Start with Sequencing data
Find peaks
Find footprints
Do differential analyses
Apply machine learning
Visualize results
10
Epinomics ETL
11
Epinomics ETL
12
A picture is worth a thousand

words
Visual inspection of model components is useful for interpretation
From:- http://undsci.berkeley.edu/article/0_0_0/howscienceworks_09
13
TF Analysis
14
Footprint Detection and Storage

Footprint Detection
Identify genomic binding sites of transcription factors (TFs) at particular genomic locations.
533 transcription factors/sample x 200k rows
Chromosome start
end
length strand
chr10
100001379
100001390
501
Chr10
100010611
100010622
500
pwm
purity
10.95492 -0.96717
+
11.32268
-0.86117
IsBound
FALSE
FALSE
Retrieve on various attributes and region identifier (transcription factor name)

select * from tf_purity_piq_new where sample_id = 2225 AND tf_name=
'CTCF.known1' AND purity >= 0.7 AND purity <= 0.9 ;
15
Footprint Detection and Storage

Retrieve data from Cassandra and process using Spark
to calculate the signal strength of each TF in the
sample.
Store the signal data in Cassandra to draw online
visualizations.
16
Peaks Processing and Storage

Each sample will have between 150K to 200K peaks
A typical biological experiment can have between 10 to 200 samples.
Consolidate and process overlapping peaks
Processed
Using Spark Graphx
Source -:http://bedtools.readthedocs.io/
A typical experiment will have between 300K to 600k overlapping peaks. (depending on dataset and
sequencing depth)
17
18
Diferential Peaks Data

Use Machine Learning to identify regions showing significant differences between two sets of data
(i.e. peaks data).@ 100k to 200K peaks
create table IF NOT EXISTS project_norm_values_diff (

project_id int, peak_window text, pvalue double, sample_id_value_map map<int, double>,
PRIMARY KEY (project_id,pvalue,peak_window)
);
select * from project_norm_values_diff where project_id = 333 and pvalue > 0.9 limit 100;
19
20
21
Diferential Peaks Analysis

Differential peaks are further grouped with kMeans-clustering using Spark Mlib.
Clustered data is stored in Cassandra.
CREATE TABLE IF NOT EXISTS difpeak_sample_clusterinfo (

project_id int,
kvalue int,
cluster_location int,
sample_id int,
avg_peakvalue double,
num_peaks_in_cluster int,
PRIMARY KEY (project_id,kvalue,sample_id,cluster_location) ) WITH CLUSTERING ORDER BY (kvalue ASC, sample_id ASC);
22
More machine learning and analysis

1. Dimensionality Reduction
(Principal Component Analysis)
rows = Projects x Samples X Samples

project_id | sample_id | pc_name | pc_value
------------+-----------+---------+----------237 |
4430 |
0 | -0.179271
237 |
4430 |
1 | 0.340772
237 |
4430 |
2 | 0.308466
23
Correlation Analysis
Pearson correlation of peaks among all samples.
24
Reproducibility Analysis
Shows variance and similarity among replicates
25

Epinomics-Cassandra Summit Submit

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Epinomics-Cassandra Summit Submit

Uploaded by

Copyright:

Available Formats

Anupama Joshi/Matt Negulescu

Cassandra/Spark solutions for Genomic big data Analysis and Visualization

DataStax, All Rights Reserved.

Drag picture to placeholder or

DataStax, All Rights Reserved.

A platform that drives

DataStax, All Rights Reserved.

DataStax, All Rights Reserved.

Typical Genomic data

From: NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"

DataStax, All Rights Reserved.

Footprinting Data (Transcription factors)

DataStax, All Rights Reserved.

DataStax, All Rights Reserved.

DataStax, All Rights Reserved.

DataStax, All Rights Reserved.

DataStax, All Rights Reserved.

DataStax, All Rights Reserved.

A picture is worth a thousand

DataStax, All Rights Reserved.

Footprint Detection and Storage

Retrieve on various attributes and region identifier (transcription factor name)

DataStax, All Rights Reserved.

Footprint Detection and Storage

DataStax, All Rights Reserved.

Peaks Processing and Storage

DataStax, All Rights Reserved.

DataStax, All Rights Reserved.

Diferential Peaks Data

create table IF NOT EXISTS project_norm_values_diff (

DataStax, All Rights Reserved.

DataStax, All Rights Reserved.

DataStax, All Rights Reserved.

Diferential Peaks Analysis

CREATE TABLE IF NOT EXISTS difpeak_sample_clusterinfo (

DataStax, All Rights Reserved.

More machine learning and analysis

rows = Projects x Samples X Samples

DataStax, All Rights Reserved.

DataStax, All Rights Reserved.

DataStax, All Rights Reserved.

You might also like