You are on page 1of 17

Genomic Database

Performance Improvements
With Document-Based
Database Architecture
Wade L. Schulz, MD/PhD
Donn K. Felker
Brent G. Nelson, MD
Sponsor: Michael Linden, MD/PhD
presentations.wadeschulz.com/aclps2014
Disclosures
Stakeholders in AgileMedicine, which does not provide
any genomics-related software, products, or services.
Whole-genome
sequencing
The obvious laboratory test
to determine the basis of
every disease process.
Shotgun sequencing
described
Capillary
electrophoresis
released
First commercial
sequencer (ABI Prism)
Pyrosequencing
developed
1980 1986
1995
Sequencing Evolution
slide: 3 / 17
ACLPS 2014 Genomic Database Performance
1998
2005
First commercial
pyrosequencer
(454 Life Sciences)
2010
Semiconductor
sequencer released
(Ion Torrent)
Database Evolution
slide: 4 / 17
ACLPS 2014 Genomic Database Performance
Relational database
defined
First NoSQL
databases begin to
emerge
MySQL released
MongoDB released
1970
1995 2007 2009 1984
Sybase released
Database Evolution
slide: 5 / 17
ACLPS 2014 Genomic Database Performance
1
Assess efficiency of
relational and document
databases for storing
genomic annotations
Experimental Goals
slide: 6 / 17
2
Quantify the benefit of in-
memory indexing to query
genomic annotations
3
Determine whether
traditional disk or solid
state drives improve
database performance
ACLPS 2014 Genomic Database Performance
Parse
Write records
or documents
into database
Load Index
Create
indexes
Create data set
from dbSNP
annotation
Query documents
and single/multi-
table records
Query
Experimental Design
slide: 7 / 17
ACLPS 2014 Genomic Database Performance
61,268,661 records
Virtual Hardware
Amazon EC2 Digital Ocean
Operating System Amazon Linux (x64) CentOS (x64)
Processors 4 vCPU/8 ECU 8 vCPU
Memory 15 GB 16 GB
Disk Type
Elastic Block Store
or PIOPS
Solid State
slide: 8 / 17
ACLPS 2014 Genomic Database Performance
Data Models
{
_id: ObjectId(),
has_sig: bool,
rsid: string,
chr: string,
loci:[
{
gene: string,
mrna_acc: string,
class: string
}
]
}
MongoDB
MySQL
What actually happens?
Results
Write Speed
slide: 10 / 17
ACLPS 2014 Genomic Database Performance
slide: 11 / 17
ACLPS 2014 Genomic Database Performance
Index Creation
MongoDB
MySQL
slide: 12 / 17
ACLPS 2014 Genomic Database Performance
Query Efficiency
String: Search for number of SNPs with gene code (COMT: 842 records)
- MongoDB: {"loci.gene":"GRIN2B"}
- MySQL: SELECT count(distinct s.rsid)
FROM locus l, snp s
WHERE l.snp_id = s.id AND l.gene = COMT

Boolean: Search for number of records with clinical significance annotation
- MongoDB: {"has_sig":"true"}
- MySQL: "SELECT count(s.id)
FROM locus l, snp s
WHERE l.snp_id = s.id AND s.has_sig = true"
MongoDB
MySQL
Why? In-place update?
slide: 13 / 17
ACLPS 2014 Genomic Database Performance
A B A B C
A B C
H
D
D

S
S
D

Conclusions
- Drive type can
drastically affect
write speed
- MongoDB has
significantly higher
write speeds,
especially for large
imports
Write speed
- MySQL is more efficient
at creating Boolean
indexes
- Index creation is
otherwise comparable
on traditional disk
- MySQL index creation
rates may suffer on SSD
Indexing
- In-memory indexing of
MongoDB provides a
significant performance
advantage (~150x)
Queries
slide: 14 / 17
ACLPS 2014 Genomic Database Performance
Wade
Schulz
Donn
Felker
Clinical Pathology
Resident,
Yale University
Brent
Nelson
Healthcare Software
Architect,
Mobile and Cloud
Computing
Neuromodulation
Fellow,
University of Minnesota
team
slide: 15 / 17
ACLPS 2014 Genomic Database Performance
Questions
Presentation Resources
ACLPS 2014 Genomic Database Performance
slide: 17 / 17
presentations.wadeschulz.com/aclps2014
github.com/wadeschulz/research_snpdb

You might also like