NIST Stonebraker PDF

Big Data Means
at Least Three Different Things.
Michael Stonebraker
The Meaning of Big Data - 3 Vs

Big Volume
With simple (SQL) analytics
With complex (non-SQL) analytics
Big Velocity
Drink from a fire hose
Big Variety
Large number of diverse data sources to integrate
Big Volume - Little Analytics

Well addressed by data warehouse crowd
Who are pretty good at SQL analytics on
Hundreds of nodes
Petabytes of data
In My Opinion.
Column stores will win
Factor of 50 or so faster than row stores
Big Data - Big Analytics

Complex math operations (machine learning, clustering,
trend detection, .)
the world of the quants
Mostly specified as linear algebra on array data
A dozen or so common inner loops

Matrix multiply
QR decomposition
SVD decomposition
Linear regression
Big Analytics on Array Data

An Accessible Example
Consider the closing price on all trading days for the
last 10 years for two stocks A and B
What is the covariance between the two timeseries?
(1/N) * sum (Ai - mean(A)) * (Bi - mean (B))
Now Make It Interesting

Do this for all pairs of 4000 stocks
The data is the following 4000 x 2000 matrix
Stock
t1
t2
t3
t4
t5
t6
t7
t2000
S1
S2
S4000
Hourly data?
All securities?
7
Array Answer
Ignoring the (1/N) and subtracting off the
means .
Stock * StockT
DBMS Requirements
Complex analytics
Covariance is just the start
Defined on arrays
Data management
Leave out outliers
Just on securities with a market cap over
$10B
These Requirements Arise in

Many Other Domains
Auto insurance
Sensor in your car (driving behavior and
location)
Reward safe driving (no jackrabbit stops,
stay out of bad neighborhoods)
Ad placement on the web
Cluster customer sessions
Lots of science apps
Genomics, satellite imagery, astronomy,
weather, .
10
In My Opinion.
The focus will shift quickly from small math to
big math in many domains
I.e. this stuff will become main stream.
11
Solution Options
R, SAS, MATLAB, et. al.
Weak or non-existent data management
File system storage
R doesnt scale and is not a parallel system

Revolution does a bit better
12
Solution Options
RDBMS alone
SQL simulator (MadLib) is slooooow (analytics * .01)

And only does some of the required operations
Coding operations as UDFs still requires you to
simulate arrays on top of tables --- sloooow
And current UDF model not powerful enough to
support iteration
13
Solution Options
R + RDBMS
Have to extract and transform the data from RDBMS

table to R data format
move the world nightmare
Need to learn 2 systems

And R still doesnt scale and is not a parallel system
14
Solution Options
Hadoop
Analytics * .01
Data management * .01
Because
No state
No sticky computation
No point-to-point messaging
Only viable if you dont care about performance
15
Solution Options
New Array DBMS designed with this market in mind
16
An Example Array Engine DB

SciDB (SciDB.org)
All-in-one:
data management on arrays
massively scalable advanced analytics
Data is updated via time-travel; not overwritten
Supports reproducibility for research and compliance
Supports uncertain data, provenance

Open source
Hardware agnostic
17
Big Velocity
Trading volumes going through the roof on
Wall Street breaking infrastructure
Sensor tagging of {cars, people, } creates a
firehose to ingest
The web empowers end users to submit
transactions sending volume through the
roof
PDAs lets them submit transactions from
anywhere.
18
Two Different Solutions

Big pattern - little state (electronic trading)
Find me a strawberry followed within 100
msec by a banana
Complex event processing (CEP) is focused

on this problem
Patterns in a firehose
P.S. I started StreamBase but I have no

current relationship with the company
19
Two Different Solutions

Big state - little pattern
For every security, assemble my real-time
global position
And alert me if my exposure is greater
than X
Looks like high performance OLTP
Want to update a database at very high
speed
20
My Suspicion
Your have 3-4 Big state - little pattern
problems for every one Big pattern little
state problem
21
Solution Choices
Old SQL
The elephants
No SQL
75 or so vendors giving up both SQL and ACID
New SQL
Retain SQL and ACID but go fast with a new
architecture
22
Why Not Use Old SQL?

Sloooow
By a couple orders of magnitude
Because of
Disk
Heavy-weight transactions
Multi-threading
See Through the OLTP Looking Glass

VLDB 2007
23
No SQL
Give up SQL
Interesting to note that
Cassandra and Mongo are
moving to (yup) SQL
Give up ACID
If you need ACID, this is a
decision to tear your hair out
by doing it in user code
Can you guarantee you wont
need ACID tomorrow?
24
VoltDB: an example of New SQL

A main memory SQL engine
Open source
Shared nothing, Linux, TCP/IP on jelly beans
Light-weight transactions
Run-to-completion with no locking
Single-threaded
Multi-core by splitting main memory
About 100x RDBMS on TPC-C
25
In My Opinion
ACID is good
High level languages are good
Standards (i.e. SQL) are good
26
Big Variety
Typical enterprise has 5000 operational systems
Only a few get into the data warehouse
What about the rest?
And what about all the rest of your data?
Spreadsheets
Access data bases
Web pages
And public data from the web?
27
The World of Data Integration

the rest of your data
enterprise
data warehouse
text
28
Summary
The rest of your data (public and private)
Is a treasure trove of incredibly valuable
information
Largely untapped
29
Data Tamer
Goal: integrate the rest of your data
Has to
Be scalable to 1000s of sites
Deal with incomplete, conflicting, and incorrect data
Be incremental
Task is never done
30
Data Tamer in a Nutshell

Apply machine learning and statistics to perform
automatic:
Discovery of structure
Entity resolution
Transformation
With a human assist if necessary
WYSIWYG tool (Data Wrangler)
31
Data Tamer
MIT research project
Looking for more integration problems
Wanna partner?
32
Take away
One size does not fit all
Plan on (say) 6 DBMS architectures
Use the right tool for the job
Elephants are not competitive
At anything
Have a bad innovators dilemma problem
33
Newest Intel Science and Technology

Center
Focus is on big data the stuff we have been talking
about
Complex analytics on big data
Scalable visualization
Lowering the impedance mismatch between
streaming and DBMSs
New storage architectures for big data
Moving DBMS functionality into silicon
Hub is at M.I.T.
Looking for more partners..
34

NIST Stonebraker PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NIST Stonebraker PDF

Uploaded by

Copyright:

Available Formats

Big Data Means

at Least Three Different Things.

The Meaning of Big Data - 3 Vs

Big Volume - Little Analytics

Big Data - Big Analytics

A dozen or so common inner loops

Big Analytics on Array Data

Now Make It Interesting

These Requirements Arise in

R doesnt scale and is not a parallel system

SQL simulator (MadLib) is slooooow (analytics * .01)

Have to extract and transform the data from RDBMS

Need to learn 2 systems

New Array DBMS designed with this market in mind

An Example Array Engine DB

Supports reproducibility for research and compliance

Supports uncertain data, provenance

Two Different Solutions

Complex event processing (CEP) is focused

P.S. I started StreamBase but I have no

Two Different Solutions

Why Not Use Old SQL?

See Through the OLTP Looking Glass

VoltDB: an example of New SQL

The World of Data Integration

Data Tamer in a Nutshell

Newest Intel Science and Technology

You might also like