Gamma

The Gamma Operator for Big Data Summarization
on an Array DBMS
Carlos Ordonez
Acknowledgments
Michael Stonebraker , MIT
My PhD students: Yiqun Zhang,
Wellington Cabrera
SciDB team: Paul Brown, Bryan Lewis,
Alex Polyakov
Why SciDB?
Large matrices beyond RAM size
Storage by row or column not good enough
Matrices natural in statistics, engineer. and science
Multidimensional arrays -> matrices, not same
thing
Parallel shared-nothing best for big data analytics
Closer to DBMS technology, but some similarity
with Hadoop
Feasible to create array operators, having matrices
as input and matrix as output
Combine processing with R package and LAPACK
3
Old: separate sufficient

statistics
New: Generalizing and

unifying Sufficient Statistics:
Z=[1,X,Y]
Equivalent equations with

projections from
Properties of
Further properties details:

non-commutative and
distributive
Storage in array chunks
10
In SciDB we store the points in X as 2D array.

SCAN
Worker
11
Array storage and

processing in SciDB
Assuming d<<n it is natural to hash
partition X by i=1..n
Gamma computation is fully parallel
maintaining local Gamma versions in
RAM.
X can be read with a fully parallel scan
No need to write Gamma from RAM to
disk during scan, unless fault tolerant
12
Point must fit in one chunk. Otherwise, join is needed (slow)
Coordinator
NO!
OK
Coordinator
Worker 1
Worker 1
13
Parallel computation
Coordinator
Worker 1
send
Worker 2
send
14
Dense matrix operator:

2
O(d n)
15
Sparse matrix operator:

O(d n) for hyper-sparse
matrix
16
Pros: Algorithm evaluation

with physical array
operators
Since x fits in one chunk joins are avoided (at least
i
2X I/O with hash or merge join)

Since xi*xiT can be computed in RAM we avoid an
aggregation which would require sorting points by i
No need to store X twice: X, XT: half I/O, half RAM
space
No need transpose X, costly reorganization even in
RAM, especially if X spans several RAM segments
Operator works in C++ compiled code: fast; vector
accessed once; direct assignment (bypass C++
functions calls)
17
System issues and

limitations
Gamma not efficiently computable in AQL or AFL: hence

operator is required
Arrays of tuples in SciDB are more general, but
cumbersome for matrix manipulation: arrays of single
attribute (double)
Points must be stored completely inside a chunk: wide
rectangular chunks: may not be I/O optimal
Slow: Arrays must be pre-processed to SciDB load format,
loaded to 1D array and re-dimensioned=>optimize load.
Multiple SciDB instances per node improve I/O speed:
interleaving CPU
Larger chunks are better: 8MB, especially for dense
matrices; avoid shuffling; avoid joins
Dense (alpha) and sparse (beta) versions
18
Benchmark: scale up
emphasis
Small: cluster with 2 Intel Quadcore
servers 4GB RAM, 3TB disk
Large: Amazon cloud 2
19
20
Why is Gamma faster than

SciDB+LAPACK?
Gamma
operator
Gamm
mem
d
Scan
CPU
merge
a op
alloc
2.2
0.0
100
3.5
0.7
0.1
8.6
0.0
200 10.9
1.0
0.1
33.9
0.1
400 38.8
2.2
0.1
134.7
0.4
800 145.0
4.6
0.1
575.5
1.0
1600 599.8
11.4
0.1
SciDB and LAPACK (crossprod() call in SciDB)

transp
TOTAL
subarray 1 repart 1 subarray 2 repart 2 build 0s gemm ScaLAPACK
MKL
ose
77.3
0.1
0.3
41.7
0.1
25.9
0.0
8.0
0.8
0.2
163.0
0.1
0.2
84.9
0.1
55.7
0.0
17.2
1.8
0.6
373.1
0.1
0.3
172.6
0.5
120.6
0.3
39.4
5.4
2.1
1497.
3
0.1
0.1
553.6
0.8
537.6
0.5 169.8
21.2
8.1
*
*
*
*
*
*
*
*
*
33.4
21
Combination: SciDB + R
22
Can Gamma operator beat

LAPACK?
Gamma versus Open BLAS LAPACK (90% performance
of MKL)
Gamma: scan, sparse/dense 2 threads;
disk+RAM+CPU
LAPACK: Open BLAS~=MKL; 2 threads;
RAM+CPU
LAPAC
LAPAC
d=100 K
d=200
K
d=400
LAPACK
d=800
LAPACK
den spar Op
spars Op
dens
Open
ndensity se se BLAS dense e
BLAS e
sparse
Op BLAS2 dense
sparse
BLAS
100
k
0.1% 3.3 0.1
0.4 11.3 0.1
1.0 38.9
0.2
3.1
145.0
0.6
10.7
100
k
1.0% 3.3 0.1
0.4 11.3 0.2
1.0 38.9
0.4
3.1
145.0
1.0
10.7
100
k
10.0% 3.3 0.5
0.4 11.3 0.9
1.0 38.9
2.2
3.1
145.0
6.2
10.7
100
k
100.0% 3.3 4.5
0.4 11.3 15.4
1.0 38.9
55.9
3.1
145.0
201.0
10.7
1M
0.1% 31.1 0.2
3.8 103.5 0.2 10.0316.5
0.4
423.2
1475.7
0.9fail
23
1M
1.0% 31.1 0.5
3.8 103.5 1.1 10.0316.5
3.8
423.2
1475.7
4.0fail
SciDB in the Cloud:

massive parallelism
24
Conclusions
One pass summarization matrix operator: parallel, scalable
Optimization of outer matrix multiplication as sum (aggregation)
of vector outer products
Dense and sparse matrix versions required
Operator compatible with any parallel shared-nothing system,
but better for arrays
Gamma matrix must fit in RAM, but n unlimited
Summarization matrix can be exploited in many intermediate
computations (with appropriate projections) in linear models
Simplifies many methods to two phases:
1.
2.
Summarization
Computing model parameters
Requires arrays, but can work with SQL or MapReduce

25
Future work: Theory

Use Gamma in other models like
logistic regression, clustering, Factor
Analysis, HMMs
Connection to frequent itemset
Sampling
Higher expected moments, co-variates
Unlikely: Numeric stability with
unnormalized sorted data
26
Future work: Systems

DONE: Sparse matrices: layout,
compression
DONE: Beat LAPACK on high d
Online model learning (cursor interface
needed, incompatible with DBMS)
Unlimited d (currently d>8000); join
required for high d? Parallel processing of
high d more complicated, chunked
Interface with BLAS and MKL, not worth it?
Faster than column DBMS for sparse?
27

Gamma

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Gamma

Uploaded by

Copyright:

Available Formats

The Gamma Operator for Big Data Summarization

Old: separate sufficient

New: Generalizing and

Equivalent equations with

Further properties details:

Storage in array chunks

In SciDB we store the points in X as 2D array.

Array storage and

Point must fit in one chunk. Otherwise, join is needed (slow)

Dense matrix operator:

Sparse matrix operator:

Pros: Algorithm evaluation

2X I/O with hash or merge join)

System issues and

Gamma not efficiently computable in AQL or AFL: hence

Why is Gamma faster than

SciDB and LAPACK (crossprod() call in SciDB)

Can Gamma operator beat

SciDB in the Cloud:

Requires arrays, but can work with SQL or MapReduce

Future work: Theory

Future work: Systems

You might also like