You are on page 1of 27

The Gamma Operator for Big Data Summarization

on an Array DBMS

Carlos Ordonez

Acknowledgments
Michael Stonebraker , MIT
My PhD students: Yiqun Zhang,
Wellington Cabrera
SciDB team: Paul Brown, Bryan Lewis,
Alex Polyakov

Why SciDB?
Large matrices beyond RAM size
Storage by row or column not good enough
Matrices natural in statistics, engineer. and science
Multidimensional arrays -> matrices, not same
thing
Parallel shared-nothing best for big data analytics
Closer to DBMS technology, but some similarity
with Hadoop
Feasible to create array operators, having matrices
as input and matrix as output
Combine processing with R package and LAPACK
3

Old: separate sufficient


statistics

New: Generalizing and


unifying Sufficient Statistics:
Z=[1,X,Y]

Equivalent equations with


projections from

Properties of

Further properties details:


non-commutative and
distributive

Storage in array chunks

10

In SciDB we store the points in X as 2D array.


SCAN

Worker

11

Array storage and


processing in SciDB
Assuming d<<n it is natural to hash
partition X by i=1..n
Gamma computation is fully parallel
maintaining local Gamma versions in
RAM.
X can be read with a fully parallel scan
No need to write Gamma from RAM to
disk during scan, unless fault tolerant
12

Point must fit in one chunk. Otherwise, join is needed (slow)

Coordinator

NO!

OK

Coordinator

Worker 1

Worker 1
13

Parallel computation

Coordinator

Worker 1
send

Worker 2
send

14

Dense matrix operator:


2
O(d n)

15

Sparse matrix operator:


O(d n) for hyper-sparse
matrix

16

Pros: Algorithm evaluation


with physical array
operators
Since x fits in one chunk joins are avoided (at least
i

2X I/O with hash or merge join)


Since xi*xiT can be computed in RAM we avoid an
aggregation which would require sorting points by i
No need to store X twice: X, XT: half I/O, half RAM
space
No need transpose X, costly reorganization even in
RAM, especially if X spans several RAM segments
Operator works in C++ compiled code: fast; vector
accessed once; direct assignment (bypass C++
functions calls)

17

System issues and


limitations

Gamma not efficiently computable in AQL or AFL: hence


operator is required
Arrays of tuples in SciDB are more general, but
cumbersome for matrix manipulation: arrays of single
attribute (double)
Points must be stored completely inside a chunk: wide
rectangular chunks: may not be I/O optimal
Slow: Arrays must be pre-processed to SciDB load format,
loaded to 1D array and re-dimensioned=>optimize load.
Multiple SciDB instances per node improve I/O speed:
interleaving CPU
Larger chunks are better: 8MB, especially for dense
matrices; avoid shuffling; avoid joins
Dense (alpha) and sparse (beta) versions
18

Benchmark: scale up
emphasis
Small: cluster with 2 Intel Quadcore
servers 4GB RAM, 3TB disk
Large: Amazon cloud 2

19

20

Why is Gamma faster than


SciDB+LAPACK?
Gamma
operator
Gamm
mem
d
Scan
CPU
merge
a op
alloc
2.2
0.0
100
3.5
0.7
0.1
8.6
0.0
200 10.9
1.0
0.1
33.9
0.1
400 38.8
2.2
0.1
134.7
0.4
800 145.0
4.6
0.1
575.5
1.0
1600 599.8
11.4
0.1

SciDB and LAPACK (crossprod() call in SciDB)


transp
TOTAL
subarray 1 repart 1 subarray 2 repart 2 build 0s gemm ScaLAPACK
MKL
ose
77.3
0.1
0.3
41.7
0.1
25.9
0.0
8.0
0.8
0.2
163.0
0.1
0.2
84.9
0.1
55.7
0.0
17.2
1.8
0.6
373.1
0.1
0.3
172.6
0.5
120.6
0.3
39.4
5.4
2.1
1497.
3
0.1
0.1
553.6
0.8
537.6
0.5 169.8
21.2
8.1
*
*
*
*
*
*
*
*
*
33.4
21

Combination: SciDB + R

22

Can Gamma operator beat


LAPACK?
Gamma versus Open BLAS LAPACK (90% performance
of MKL)
Gamma: scan, sparse/dense 2 threads;
disk+RAM+CPU
LAPACK: Open BLAS~=MKL; 2 threads;
RAM+CPU
LAPAC
LAPAC

d=100 K
d=200
K
d=400
LAPACK
d=800
LAPACK
den spar Op
spars Op
dens
Open
ndensity se se BLAS dense e
BLAS e
sparse
Op BLAS2 dense
sparse
BLAS
100
k
0.1% 3.3 0.1
0.4 11.3 0.1
1.0 38.9
0.2
3.1
145.0
0.6
10.7
100
k
1.0% 3.3 0.1
0.4 11.3 0.2
1.0 38.9
0.4
3.1
145.0
1.0
10.7
100
k
10.0% 3.3 0.5
0.4 11.3 0.9
1.0 38.9
2.2
3.1
145.0
6.2
10.7
100
k
100.0% 3.3 4.5
0.4 11.3 15.4
1.0 38.9
55.9
3.1
145.0
201.0
10.7
1M
0.1% 31.1 0.2
3.8 103.5 0.2 10.0316.5
0.4
423.2
1475.7
0.9fail
23
1M
1.0% 31.1 0.5
3.8 103.5 1.1 10.0316.5
3.8
423.2
1475.7
4.0fail

SciDB in the Cloud:


massive parallelism

24

Conclusions
One pass summarization matrix operator: parallel, scalable
Optimization of outer matrix multiplication as sum (aggregation)
of vector outer products
Dense and sparse matrix versions required
Operator compatible with any parallel shared-nothing system,
but better for arrays
Gamma matrix must fit in RAM, but n unlimited
Summarization matrix can be exploited in many intermediate
computations (with appropriate projections) in linear models
Simplifies many methods to two phases:
1.
2.

Summarization
Computing model parameters

Requires arrays, but can work with SQL or MapReduce


25

Future work: Theory


Use Gamma in other models like
logistic regression, clustering, Factor
Analysis, HMMs
Connection to frequent itemset
Sampling
Higher expected moments, co-variates
Unlikely: Numeric stability with
unnormalized sorted data
26

Future work: Systems


DONE: Sparse matrices: layout,
compression
DONE: Beat LAPACK on high d
Online model learning (cursor interface
needed, incompatible with DBMS)
Unlimited d (currently d>8000); join
required for high d? Parallel processing of
high d more complicated, chunked
Interface with BLAS and MKL, not worth it?
Faster than column DBMS for sparse?
27

You might also like