Professional Documents
Culture Documents
Introduction
Anupama Joshi Technology Infrastructure and
Execution at Epinomics
ajoshi@epinomics.co
http://www.linkedin.com/in/anupamajoshi
Matt Negulescu - Product Requirements and User
Interaction
mnegulescu@epinomics.co
1
2
3
4
5
Introduction
What is Epigenomics?
Genomic Data and Epigenomic Data
Why Cassandra?
Demo
Epinomics
What is Epigenomics?
The study of modifications that turn genes on or of, without affecting the
DNA sequence.
Genomics
DNA is the hardware of the body:
static and descriptive (i.e. nature).
Epigenomics
Epigenome is the software layer:
dynamically turns genes on or off (i.e.
nature and nurture).
Epigenomic Data
Peaks Data
Regions of the genome where DNA was accessible
during the experiment.
chr1 713701
chr1 804976
714600
805650
peak.1
peak.2
899 +
674 +
Why Cassandra?
High performance
Fault Tolerant
Linear scalability
Dynamic columns
Structured and unstructured
data
Flexible data model
Real time querying
Epinomics Cluster
6 nodes
@600gb and growing
2 datacenters
Epinomics Pipeline
Start with Sequencing data
Find peaks
Find footprints
Do differential analyses
Apply machine learning
Visualize results
10
Epinomics ETL
11
Epinomics ETL
12
From:- http://undsci.berkeley.edu/article/0_0_0/howscienceworks_09
DataStax, All Rights Reserved.
13
TF Analysis
14
end
length strand
chr10
100001379
100001390
501
Chr10
100010611
100010622
500
pwm
purity
10.95492 -0.96717
+
11.32268
-0.86117
IsBound
FALSE
FALSE
15
16
Processed
Using Spark Graphx
Source -:http://bedtools.readthedocs.io/
A typical experiment will have between 300K to 600k overlapping peaks. (depending on dataset and
sequencing depth)
17
18
19
20
21
kvalue int,
cluster_location int,
sample_id int,
avg_peakvalue double,
num_peaks_in_cluster int,
PRIMARY KEY (project_id,kvalue,sample_id,cluster_location) ) WITH CLUSTERING ORDER BY (kvalue ASC, sample_id ASC);
22
4430 |
0 | -0.179271
237 |
4430 |
1 | 0.340772
237 |
4430 |
2 | 0.308466
23
Correlation Analysis
Pearson correlation of peaks among all samples.
24
Reproducibility Analysis
Shows variance and similarity among replicates
25