You are on page 1of 25

Anupama Joshi/Matt Negulescu

Cassandra/Spark solutions for Genomic big data Analysis and Visualization

Introduction
Anupama Joshi Technology Infrastructure and
Execution at Epinomics
ajoshi@epinomics.co
http://www.linkedin.com/in/anupamajoshi
Matt Negulescu - Product Requirements and User
Interaction
mnegulescu@epinomics.co

DataStax, All Rights Reserved.

Drag picture to placeholder or


click icon to add

1
2
3
4
5

Introduction
What is Epigenomics?
Genomic Data and Epigenomic Data
Why Cassandra?
Demo

DataStax, All Rights Reserved.

Epinomics

A platform that drives


personalized medicine
by leveraging big data
analytics and proprietary
epigenomics technology.

DataStax, All Rights Reserved.

What is Epigenomics?
The study of modifications that turn genes on or of, without affecting the
DNA sequence.

Genomics
DNA is the hardware of the body:
static and descriptive (i.e. nature).

DataStax, All Rights Reserved.

Epigenomics
Epigenome is the software layer:
dynamically turns genes on or off (i.e.
nature and nurture).

Typical Genomic data


Typical genomic
sequencing data contains
the protein letters ATCG .
Most research work focuses
on variation from
standard genome
sequences.

From: NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"

DataStax, All Rights Reserved.

Epigenomic Data
Peaks Data
Regions of the genome where DNA was accessible
during the experiment.
chr1 713701
chr1 804976

714600
805650

peak.1
peak.2

899 +
674 +

Footprinting Data (Transcription factors)


A signal indicating a protein (i.e. a transcription factor)
binding to the DNA.

DataStax, All Rights Reserved.

Why Cassandra?
High performance
Fault Tolerant
Linear scalability
Dynamic columns
Structured and unstructured
data
Flexible data model
Real time querying

DataStax, All Rights Reserved.

Epinomics Cluster
6 nodes
@600gb and growing
2 datacenters

DataStax, All Rights Reserved.

Epinomics Pipeline
Start with Sequencing data
Find peaks
Find footprints
Do differential analyses
Apply machine learning
Visualize results

DataStax, All Rights Reserved.

10

Epinomics ETL

DataStax, All Rights Reserved.

11

Epinomics ETL

DataStax, All Rights Reserved.

12

A picture is worth a thousand


words
Visual inspection of model components is useful for interpretation

From:- http://undsci.berkeley.edu/article/0_0_0/howscienceworks_09
DataStax, All Rights Reserved.

13

TF Analysis

DataStax, All Rights Reserved.

14

Footprint Detection and Storage


Footprint Detection
Identify genomic binding sites of transcription factors (TFs) at particular genomic locations.
533 transcription factors/sample x 200k rows
Chromosome start

end

length strand

chr10

100001379

100001390

501

Chr10

100010611

100010622

500

pwm

purity

10.95492 -0.96717
+

11.32268

-0.86117

IsBound

FALSE
FALSE

Retrieve on various attributes and region identifier (transcription factor name)


select * from tf_purity_piq_new where sample_id = 2225 AND tf_name=
'CTCF.known1' AND purity >= 0.7 AND purity <= 0.9 ;

DataStax, All Rights Reserved.

15

Footprint Detection and Storage


Retrieve data from Cassandra and process using Spark
to calculate the signal strength of each TF in the
sample.
Store the signal data in Cassandra to draw online
visualizations.

DataStax, All Rights Reserved.

16

Peaks Processing and Storage


Each sample will have between 150K to 200K peaks
A typical biological experiment can have between 10 to 200 samples.
Consolidate and process overlapping peaks

Processed
Using Spark Graphx
Source -:http://bedtools.readthedocs.io/

A typical experiment will have between 300K to 600k overlapping peaks. (depending on dataset and
sequencing depth)

DataStax, All Rights Reserved.

17

DataStax, All Rights Reserved.

18

Diferential Peaks Data


Use Machine Learning to identify regions showing significant differences between two sets of data
(i.e. peaks data).@ 100k to 200K peaks

create table IF NOT EXISTS project_norm_values_diff (


project_id int, peak_window text, pvalue double, sample_id_value_map map<int, double>,
PRIMARY KEY (project_id,pvalue,peak_window)
);
select * from project_norm_values_diff where project_id = 333 and pvalue > 0.9 limit 100;

DataStax, All Rights Reserved.

19

DataStax, All Rights Reserved.

20

DataStax, All Rights Reserved.

21

Diferential Peaks Analysis


Differential peaks are further grouped with kMeans-clustering using Spark Mlib.
Clustered data is stored in Cassandra.

CREATE TABLE IF NOT EXISTS difpeak_sample_clusterinfo (


project_id int,

kvalue int,

cluster_location int,

sample_id int,

avg_peakvalue double,

num_peaks_in_cluster int,

PRIMARY KEY (project_id,kvalue,sample_id,cluster_location) ) WITH CLUSTERING ORDER BY (kvalue ASC, sample_id ASC);

DataStax, All Rights Reserved.

22

More machine learning and analysis


1. Dimensionality Reduction
(Principal Component Analysis)

rows = Projects x Samples X Samples


project_id | sample_id | pc_name | pc_value
------------+-----------+---------+----------237 |

4430 |

0 | -0.179271

237 |

4430 |

1 | 0.340772

237 |

4430 |

2 | 0.308466

DataStax, All Rights Reserved.

23

Correlation Analysis
Pearson correlation of peaks among all samples.

DataStax, All Rights Reserved.

24

Reproducibility Analysis
Shows variance and similarity among replicates

DataStax, All Rights Reserved.

25

You might also like