Professional Documents
Culture Documents
@wesmckinn
Former quant and MIT math dude
Creator of Pandas project for Python
Author of
Python for Data Analysis OReilly
3 www.datapad.io
> 20k copies since Oct 2012
Bringing many new people
to Python and data analysis
with code
4 www.datapad.io
http://datapad.io
Founded in 2013, located in SF
10 www.datapad.io
11 www.datapad.io
pandas, the project
12 www.datapad.io
But.
13 www.datapad.io
Some things to love
Table reshaping
Missing data handling
pandas.merge, pandas.concat
15 www.datapad.io
Some pandas use cases
General data wrangling
ETL jobs
Business analytics (incl. BI uses)
16 www.datapad.io
pandas does many things
that are tedious, slow, or
dicult to do correctly
without it
Unfortunately, pandas is
not a database
#1 Slightly too far from
the metal
DataFrames internal structure
intended to make row-oriented ops
fast on numerical data
19 www.datapad.io
#2 No support (yet) for
memory maps
Many analytics ops require a small portion
of the data
20 www.datapad.io
#2 No support (yet) for
memory maps
N.B. HDF5/PyTables support is a
partial solution
21 www.datapad.io
#3 No tight database
integration
Makes it dicult to be a serious tool
in an ETL toolchain on top of some
SQL-ish system
22 www.datapad.io
#3 No tight database
integration
Jobs with heavy SQL-reading are
slow and use tons of memory
23 www.datapad.io
#4 Best-efforts NA
representation
Inconsistent representation of
missing data
24 www.datapad.io
#5 RAM management
25 www.datapad.io
#6 Weak support for
categorical data
Makes pandas not quite a fully-
edged R replacement
26 www.datapad.io
#7 Complex GroupBy
operations get messy
Must write custom functions to pass
to .apply(..)
27 www.datapad.io
#8 Appending data slow
and tedious
DataFrame not intended as a
database table
29 www.datapad.io
#10 No true query
processing layer
Filter WHERE, HAVING
Group GROUP BY
Join JOIN
Aggregate SUM, MEAN, ...
Limit/TopK LIMIT
Sorting ORDER BY
30 www.datapad.io
#11 Slow: no multicore /
distributed algos
Hampered by use of Python data
structures / GIL interactions
31 www.datapad.io
Oh no what do we do
Stop believing in the one
tool to rule them all
Real Artists Ship
- Steve Jobs
www.datapad.io
Focus on results
36 www.datapad.io
A new project
37 www.datapad.io
Badger in a nutshell
Consistent data type system
38 www.datapad.io
Badger in a nutshell
Time series analytics
39 www.datapad.io
Some benchmarks
Data set: 2012 Election data (FEC)
5.3 mm records 7 columns
Tools
pandas
badger
R: data.table
SQL: PostgreSQL, SQLite
40 www.datapad.io
Query 1
Total contributions by candidate
SELECT
cand_nm,
sum(contb_receipt_amt)
AS
total
FROM
fec
GROUP
BY
cand_nm
41 www.datapad.io
Query 1
Total contributions by candidate
badger
(in-memory)
:
19ms
(1x)
badger
(from-disk)
:
131ms
(6.9x)
pandas
(in-memory)
:
273ms
(14.3x)
R
data.table
1.8.10:
382ms
(20x)
PostgreSQL
:
4.7s
(247x)
SQLite
:
72s
(3800x)
42 www.datapad.io
Query 2
Total contributions by candidate
and state
SELECT
cand_nm,
contbr_st,
sum(contb_receipt_amt)
AS
total
FROM
fec
GROUP
BY
cand_nm,
contbr_st
43 www.datapad.io
Query 2
Total contributions by candidate and
state
badger
(in-memory)
:
269ms
(1x)
badger
(from-disk)
:
391ms
(1.5x)
R
data.table
1.8.10:
500ms
(1.8x)
pandas
(in-memory)
:
770ms
(2.9x)
PostgreSQL
:
5.96s
(23x)
44 www.datapad.io
Query 3
Total contributions by candidate
and state with 2 lter predicates
SELECT
cand_nm,
sum(contb_receipt_amt)
as
total
FROM
fec
WHERE
contb_receipt_dt
BETWEEN
'2012-05-01'
and
'2012-11-05'
AND
contb_receipt_amt
BETWEEN
0
and
2500
GROUP
BY
cand_nm
45 www.datapad.io
Query 3
Total contributions by candidate
and state with 2 lter predicates
46 www.datapad.io
Badger, the future
Distributed in-memory analytics
Multicore algorithms
ETL job-building tools
Open source in some form someday
Looking for algorithms hackers to help
47 www.datapad.io
Thank you!
48 www.datapad.io