Practical Medium Data Analytics with Python - PyData NYC 2013

Practical Medium Data
Analytics with Python

PyData NYC 2013
Practical Medium Data
Analytics with Python
10 Things I Hate
About pandas
PyData NYC 2013
Wes McKinney
@wesmckinn
Former quant and MIT math dude
Creator of Pandas project for Python
Author of
Python for Data Analysis OReilly
Founder and CEO of DataPad
3 www.datapad.io
> 20k copies since Oct 2012
Bringing many new people
to Python and data analysis
with code
4 www.datapad.io
http://datapad.io
Founded in 2013, located in SF
In private beta, join us!
Hiring for engineering

www.datapad.io
Why hate on pandas?
7 www.datapad.io
pandas rocks!
So, pandas
Easy-to-use, fast in-memory data wrangling

and analytics library
Enabled loads of complex data work to be

done by mere mortals in Python
Might have kept R from taking over the

world (hehe)
10 www.datapad.io
11 www.datapad.io
pandas, the project
170 distinct contributors

Over 5400 issues and pull requests
on GitHub
Upcoming 0.13 release
12 www.datapad.io
But.
pandass broad applicability also a

liability
Only game in town for lot of things
pandas being used in some

unplanned ways
13 www.datapad.io
Some things to love
No more structured dtype drudgery!

Easy IO!
Data alignment!
Hierarchical indexing!
Time series analytics!
14 www.datapad.io
More things to love
Table reshaping
Missing data handling
pandas.merge, pandas.concat
Expressive groupby machinery
15 www.datapad.io
Some pandas use cases
General data wrangling
ETL jobs
Business analytics (incl. BI uses)
Time series analysis, statistical

modeling
16 www.datapad.io
pandas does many things
that are tedious, slow, or
dicult to do correctly
without it
Unfortunately, pandas is
not a database
#1 Slightly too far from
the metal
DataFrames internal structure
intended to make row-oriented ops
fast on numerical data
Python objects can be used as data,

indices (a feature, not a bug)
19 www.datapad.io
#2 No support (yet) for
memory maps
Many analytics ops require a small portion
of the data
Many ways to materialize the full data set

in memory by accident
Axis indexes wouldnt necessarily make

sense on out of core data sets
20 www.datapad.io
#2 No support (yet) for
memory maps
N.B. HDF5/PyTables support is a
partial solution
21 www.datapad.io
#3 No tight database
integration
Makes it dicult to be a serious tool
in an ETL toolchain on top of some
SQL-ish system
Inadequacy of pandas/NumPy data

type systems
22 www.datapad.io
#3 No tight database
integration
Jobs with heavy SQL-reading are
slow and use tons of memory
TODO: integrate pandas with ODBC

C API and write out SQL data directly
into NumPy arrays
23 www.datapad.io
#4 Best-efforts NA
representation
Inconsistent representation of
missing data
No Boolean or Integer NA values
NA needs to be a rst class citizen in

analytics operations
24 www.datapad.io
#5 RAM management
Dicult to understand footprint of pandas

object
Ample data copying throughout library

Would benet from being able to compress
data in-memory or shuttle data temporarily
to disk
25 www.datapad.io
#6 Weak support for
categorical data
Makes pandas not quite a fully-
edged R replacement
GroupBy and Joins slower than they

could be
26 www.datapad.io
#7 Complex GroupBy
operations get messy
Must write custom functions to pass
to .apply(..)
Easy to run up against DRY

problems and general Python
syntax limitations
27 www.datapad.io
#8 Appending data slow
and tedious
DataFrame not intended as a
database table
Makes streaming data use a

challenge
B+ tree tables interesting?

28 www.datapad.io
#9 Limited type system,
column metadata
Currencies, units
Time zones
Geographic data
Composite data types
29 www.datapad.io
#10 No true query
processing layer
Filter WHERE, HAVING
Group GROUP BY
Join JOIN
Aggregate SUM, MEAN, ...
Limit/TopK LIMIT
Sorting ORDER BY
30 www.datapad.io
#11 Slow: no multicore /
distributed algos
Hampered by use of Python data
structures / GIL interactions
Object internals not designed for

concurrent use
31 www.datapad.io
Oh no what do we do
Stop believing in the one
tool to rule them all
Real Artists Ship
- Steve Jobs
www.datapad.io
Focus on results
I am heavily biased by focus on

business analytics/BI use cases
Need production-ready software to

ship in relatively short time frame
36 www.datapad.io
A new project
In internal development at DataPad

Code named badger
pandas-ish syntax: designed for
data processing and analytical
queries
37 www.datapad.io
Badger in a nutshell
Consistent data type system
Compressed columnar binary storage
High perf analytical query processor

Data preparation/cleaning tools
38 www.datapad.io
Badger in a nutshell
Time series analytics
Immutable array data, little copying
Analytics kernels: written C with no

dependencies
Caching of useful intermediates
39 www.datapad.io
Some benchmarks
Data set: 2012 Election data (FEC)
5.3 mm records 7 columns
Tools
pandas
badger
R: data.table
SQL: PostgreSQL, SQLite
40 www.datapad.io
Query 1
Total contributions by candidate
SELECT cand_nm,
sum(contb_receipt_amt) AS total
FROM fec
GROUP BY cand_nm
41 www.datapad.io
Query 1
badger (in-memory) : 19ms (1x)
badger (from-disk) : 131ms (6.9x)
pandas (in-memory) : 273ms (14.3x)
R data.table 1.8.10: 382ms (20x)
PostgreSQL : 4.7s (247x)
SQLite : 72s (3800x)
42 www.datapad.io
Query 2
and state
SELECT cand_nm, contbr_st,
sum(contb_receipt_amt) AS total
FROM fec
GROUP BY cand_nm, contbr_st
43 www.datapad.io
Query 2
Total contributions by candidate and
state
R data.table 1.8.10: 500ms (1.8x)
44 www.datapad.io
Query 3
and state with 2 lter predicates
SELECT cand_nm,
sum(contb_receipt_amt) as total
FROM fec
WHERE contb_receipt_dt BETWEEN
'2012-05-01' and '2012-11-05'
AND contb_receipt_amt BETWEEN
0 and 2500
GROUP BY cand_nm
45 www.datapad.io
Query 3
and state with 2 lter predicates

46 www.datapad.io
Badger, the future
Distributed in-memory analytics
Multicore algorithms
ETL job-building tools
Open source in some form someday
Looking for algorithms hackers to help
47 www.datapad.io
Thank you!
48 www.datapad.io

Practical Medium Data Analytics with Python - PyData NYC 2013

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Practical Medium Data Analytics with Python - PyData NYC 2013

Uploaded by

Copyright:

Available Formats

Practical Medium Data

Analytics with Python

Founder and CEO of DataPad

In private beta, join us!

Hiring for engineering

Easy-to-use, fast in-memory data wrangling

Enabled loads of complex data work to be

Might have kept R from taking over the

170 distinct contributors

Upcoming 0.13 release

pandass broad applicability also a

Only game in town for lot of things

pandas being used in some

No more structured dtype drudgery!

Expressive groupby machinery

Time series analysis, statistical

Python objects can be used as data,

Many ways to materialize the full data set

Axis indexes wouldnt necessarily make

Inadequacy of pandas/NumPy data

TODO: integrate pandas with ODBC

No Boolean or Integer NA values

NA needs to be a rst class citizen in

Dicult to understand footprint of pandas

Ample data copying throughout library

GroupBy and Joins slower than they

Easy to run up against DRY

Makes streaming data use a

B+ tree tables interesting?

Composite data types

Object internals not designed for

I am heavily biased by focus on

Need production-ready software to

In internal development at DataPad

Compressed columnar binary storage

High perf analytical query processor

Immutable array data, little copying

Analytics kernels: written C with no

Caching of useful intermediates

badger (in-memory) : 96ms (1x)

You might also like