Professional Documents
Culture Documents
on an Array DBMS
Carlos Ordonez
Acknowledgments
Michael Stonebraker , MIT
My PhD students: Yiqun Zhang,
Wellington Cabrera
SciDB team: Paul Brown, Bryan Lewis,
Alex Polyakov
Why SciDB?
Large matrices beyond RAM size
Storage by row or column not good enough
Matrices natural in statistics, engineer. and science
Multidimensional arrays -> matrices, not same
thing
Parallel shared-nothing best for big data analytics
Closer to DBMS technology, but some similarity
with Hadoop
Feasible to create array operators, having matrices
as input and matrix as output
Combine processing with R package and LAPACK
3
Properties of
10
Worker
11
Coordinator
NO!
OK
Coordinator
Worker 1
Worker 1
13
Parallel computation
Coordinator
Worker 1
send
Worker 2
send
14
15
16
17
Benchmark: scale up
emphasis
Small: cluster with 2 Intel Quadcore
servers 4GB RAM, 3TB disk
Large: Amazon cloud 2
19
20
Combination: SciDB + R
22
d=100 K
d=200
K
d=400
LAPACK
d=800
LAPACK
den spar Op
spars Op
dens
Open
ndensity se se BLAS dense e
BLAS e
sparse
Op BLAS2 dense
sparse
BLAS
100
k
0.1% 3.3 0.1
0.4 11.3 0.1
1.0 38.9
0.2
3.1
145.0
0.6
10.7
100
k
1.0% 3.3 0.1
0.4 11.3 0.2
1.0 38.9
0.4
3.1
145.0
1.0
10.7
100
k
10.0% 3.3 0.5
0.4 11.3 0.9
1.0 38.9
2.2
3.1
145.0
6.2
10.7
100
k
100.0% 3.3 4.5
0.4 11.3 15.4
1.0 38.9
55.9
3.1
145.0
201.0
10.7
1M
0.1% 31.1 0.2
3.8 103.5 0.2 10.0316.5
0.4
423.2
1475.7
0.9fail
23
1M
1.0% 31.1 0.5
3.8 103.5 1.1 10.0316.5
3.8
423.2
1475.7
4.0fail
24
Conclusions
One pass summarization matrix operator: parallel, scalable
Optimization of outer matrix multiplication as sum (aggregation)
of vector outer products
Dense and sparse matrix versions required
Operator compatible with any parallel shared-nothing system,
but better for arrays
Gamma matrix must fit in RAM, but n unlimited
Summarization matrix can be exploited in many intermediate
computations (with appropriate projections) in linear models
Simplifies many methods to two phases:
1.
2.
Summarization
Computing model parameters