135 - PGCon 2009 - Aster v6

Building PetaByte Warehouses with Unmodified PostgreSQL
Emmanuel Cecchet Member of Research Staff May 21st, 2009
Topics
Introduction to frontline data warehouses PetaByte warehouse design with PostgreSQL Aster contributions to PostgreSQL Q&A
PGCon 2009, Ottawa 2009 Aster Data Systems
Enterprise Data Warehouse Under Stress
Enterprise Enterprise Data Data Warehouse Warehouse
Offloading The Enterprise Data Warehouse
Frontline Frontline Data Data Warehouse Warehouse
Enterprise Enterprise Data Data Warehouse Warehouse
Archival Archival Data Data Warehouse Warehouse
Petabytes source data 24 x7 Availability TB/Day Load capacity In-database transforms Rapid data access
Petabytes detailed data Low cost/TB Flexible compression On-line access Aging out to off-line
Requirements for Frontline Data Warehouses

!"#$"% &! '(#'' "!&!')0" 1"! &%3"( &! )% 40 3!"&!" 2 2 56 7#'#)" ) ))"% %&8)#'" 9#)#'#$" ))"% %&8)#'" @6A BC D 6C EF )$" % #)G 0" H "(!0 P") )$" I
5
Who is Aster Data Systems?

Aster nCluster is a software-only RDBMS for large-scale frontline data warehousing
High performance High availability High value analytics Low cost Always Parallel MPP architecture Always On on-line operations In-Database MapReduce Petabytes on commodity HW
MySpace Frontline Data Warehouse

!"

#$$$%
& &"!H"!

EFG 77)1 H I IB BP P7Q 7Q I 9 I 3C) D 5Q1 3C) 9D 5Q1 '()01 234530 67(8) 9 '()01 @ABC D 01
R STUUTVW XY`Wab c defgS hì pqr stuvwuxtxy t
" ! d " !e f ! ghhi jklmno pqhh rs tuvutwrx
Topics
Petabyte Datawarehouse Design PostgreSQL as a building block

Do not hack PostgreSQL to serve the distributed database Build on top of mainline PostgreSQL Use standard Postgres APIs
Service Oriented Architecture

Hierarchical query planning and optimization Shared nothing replication treating Postgres and Linux as a service Compression at the OS level transparently to PostgreSQL In-database Map-Reduce out-of-process
Aster nCluster Database

Queen Node
Queries/Answers Queries/Answers
Queen Server Group
uy yyv tuwxvwtuy y u t uwu sttwuvy xwy ! "#$# %&#'( ) "#0"#1( #02)1)" ( #301#))" 4 5"# 01#))" $ $11 1 $ ) #1 6)7 1 $ "7# 8"#9# @)A 6 $11 A)1 BCD "# E29 F$# G1#))" 7) $11 ) " 100#"0#)1 HI P)#)Q 7 HI $ $11 1#" 8"#9#
Worker Node
Queries Queries
Worker Server Group
Data Data
Loader Node
Loader/Exporter Server Group
Aster nCluster Database

10
Query Processing: How It Works

g pS pS
S gVVUb
"

2 !5D QP #
7808
7808
7808
7808
A: Users send queries to the Queen (via ACT, ODBC/JDBC, BI tools, etc.) B: The Queen parses the query, creates an optimal distributed plan, sends subqueries to the Workers, and supervises the processing C: Workers execute their subqueries (locally or via distributed communication) D: The Queen aggregates Worker results and returns the final result to the user
11
Table Compression Architecture

Query
cdefghii qhrste p cdefghii qhrste p cdefghii qhrste p
Logical Database (eg. Schema) cdefghiip usvw cdefghiip usvw cdefghii cdefghii d qhrste p cdefghiip usvw cdefghiip usvw xdy p cdefghiihr cdefghiip usvw cdefghiip usvw cdefghii p cdefghiip usvw cdefghiip usvw rs t q h e cdefghiip usvw cdefghiip usvw cdefghii p xdy cdefghiip usvw cdefghiip usvw Physical Tablespaces
How It Works
%&'& () 012345))56 7 65012345))56 8591@ 91A(0&9 6&'&8&)5 B'1456 (C 6(DD545C' '&895)3&05) 6535C6(CA 1C 012345))(1C 95E59 F(AG 012345))(1C 4&'(1) H8(AA54 8910I )(P5) Q(596 2145 01)' )&E(CA)R %&'&8&)5 '4&C)3&45C' H5&)5 1D DS'S45 T1)'A45) S3A4&65)R U1C0S445C0Q HG(AGV354D142&C05 2S9'(V'&895 012345))(1CR T54D142&C05W XY5@V41@` aS54(5) 61Cb' C556 DS99 '&895 65012345))(1C
Architecture Benefits:
12
Table Compression Enables Powerful Archival
Compressed - High Compressed - Medium Compressed - Low Not Compressed
Older data accessed less frequently Compress to save space and cost Oldest data is compressed the most, recent is compressed the least Compressed tables are fully available for queries (true online archival)
13
What is MapReduce and Why Should I Care?

What is MapReduce?
Popularized by Google
http://labs.google.com/papers/mapreduce.html
Processes data in parallel across distributed cluster
Why is MapReduce significant?

Empowers ordinary developers Write application logic, not debug cluster communication code
Why is In-Database MapReduce significant?

Unites MapReduce with SQL: power invoked from SQL Develop SQL/MR functions with common languages
14
Aster In-Database MapReduce

Users, analysts, applications
SQL/MR Functions
In-Database MapReduce
Data Store Engine Aster nCluster

15
Slide 15
In-Database MapReduce
Extensible framework (MapReduce + SQL)
Flexible: Map-Reduce expressiveness, languages, polymorphism Performance: Massively parallel, computational push-down Availability: Fault isolation, resource management
Out-of-process executables
Does not use PL/* for custom code execution Can execute Map and Reduce functions in any language that has a runtime on Linux (e.g. Java, Python, C#, C/C++, Ruby, Perl, R, etc) Standard PostgreSQL APIs to send/receive data to executables Fault isolation, security and resource management for arbitrary user code
16
Always On: Minimize Planned Downtime
Rebalance Data Live Live Queries Queries
Add Capacity
Data Data Backup Backup
Load & Export
Backup & Restore
17
Precision Scaling
! " # $
% & ' ( $ (
When more CPU/memory/capacity are needed, new nodes can be added for scale-out. Precision Scaling uses standard PostgreSQL APIs to migrate vWorker partitions to new nodes either for load balancing (more compute power) or capacity expansion Example: Assume Workers 1/2/3 are 100% CPU-bottlenecked. Incorporation adds a new Worker4 node and migrates over vWorker partitions D/H/L. As a result of loadbalancing, CPU-utilization drops to 75% per node, eliminating hotspots.
18
Replication
!&H#%" %#!#4"% &"!)0" )% "'
H#4## I
vwvwtu ty wy wv v yv tu tu utv ut u v uv t ut wx wvwtu tvvy wuyv v tyy u yyv
txv
!"#) !& !"&1" )% "#)( # "!H#0"$ %uyuv vt tyvy w& t ' u (v ywwu wux& )0 wwvwy vt vx v uy
19
Fault Tolerance & Automatic Online Failover

9@55A 1234536

BCD EFGHIP
Q $ Q %
1234537 Q

&Q

1234538 Q Q

! "
123453R Q Q !

# $
123453S Q " Q #

% &
Replication Failover
Automatic, non-disruptive, graceful performance impact
Replication Restoration
Delta-based (fast) and online (non-disruptive)
20
Using Commodity Hardware

w wv v u v swy t xuw v

. . .
x vwst sy & x vw sy (wyv wxv t ' 0'00'%' (wy ( w y wu ( 0xu v
!"# !#
$ "
!%"&#
y wv

' ( & 0 t u w u w 'w t) yv %s)

21
Scaling OnOn-Demand to a PetaByte
Commodity Hardware 2 TB Building Block Dell, HP, IBM, Intel x86 16 GB Memory 2.4TB of Storage
8 Disks
More Blocks = More Power
$5k to $10k Node
Massive Power Per Rack
012 1
U3baì
4V5`
160 Cores 640 GB RAM 48 TB SAS
22
Heterogeneous Hardware Support

...
Viì
CPU Memory Disk

... ....
Heterogeneous HW support enables customers to add cheaper/faster nodes to existing systems for investment protection Mix-n-match different servers as you grow (faster CPUs, more memory, bigger disk drives, different vendors, etc)
23
Topics
24
Error Logging in COPY
Separate good from malformed tuples Per tuple error information Errors to capture
Type mismatch (e.g. text vs. int) Check constraints (e.g. int < 0) Malformed chars (e.g. invalid UTF-8 seq.) Missing / extra column
Low-performance overhead Activated on-demand using environment variables
25
Error Logging in COPY
Detailed error context is logged along with tuple content
26
Error Logging Performance

1 million tuples COPY performance
27
Auto-partitioning in COPY
COPY into a parent table route tuples directly in the child table with matching constraints
COPY y2008 FROM data.txt
28
Auto-partitioning in loading
Activated on-demand
set tuple_routing_in_copy = 0/1;
Configurable LRU cache size

set tuple_routing_cache_size = 3;
COPY performance into parent table similar to direct COPY in child table if data is sorted Will leverage partitioning information in the future (WIP for 8.5)
29
Other contributions
Temporary tables and 2PC transactions [Auto-]Partitioning infrastructure Regression test suite LFI (http://lfi.sourceforge.net/)
Fault injection at the library level or below out-of-memory conditions, network connection errors, interrupted system calls, data corruption, hardware failures, etc Lightning talk on Friday!
30
Topics
Introduction to Aster and data warehousing PetaByte warehouse design with PostgreSQL Aster contributions to PostgreSQL Q&A
31
PetaByte Warehouses with PostgreSQL

Unmodified PostgreSQL Always Parallel MPP architecture Always On on-line operations In-Database MapReduce PetaByte on commodity Hardware
32
Aster Data Systems

Learn more
www.asterdata.com Free TDWI report on advanced analytics:
asterdata.com/mapreduce
Free Gartner webcast on mission-critical DW:

asterdata.com/gartner
Contact us
hello@asterdata.com
33
Bonus slides
34
nCluster Components
TWTW! S gVVUb pqaq hiV!iq"b
$"#$%$"#

(08 9
I G

7 8
9 @
3C D
"

D 573 C 480AC8C
12
D 2 5 A78
&'(
) 2G 0
nCluster
35
Aster Loader
Vq5 SqUqWBTW!
CDEFFGHIGP QRSTU VRWR XUWX YSURWU R VRWR WSR`XaUS bUSacSdR`YU ecWWfUÙYgh ipFqrspHP tufWvbfU QcRVUSX RSU dRbbUV Wc wxcSgUSX c` xcSgUS `cVUXh yaWUS bRSWvWvc`v`T QcRVUS vff bRSRffUf fcRV u`vuU VRWR v`Wc wxcSg USX h GHGsrP QvÙRSf XYRfRefU fcRV bUSacSdR`YU aScd QcRVUSX Wc xcSg USX
Scalable Partitioning
CDEFFGHIGP RSWvWvc`v`T duXW eU XYRfRefU R`V `cW vdbUVU WU fcRVv`T bScYUXX ipFqrspHP RY QcRVUS Y c`WRv`X R RSWvWvcÙS vY uXUX R` RfTcSvWd Wc RXX vT` VRWR v`Wc euY gUWXh RY euYg UW vX u`vuUf dRbbUV Wc vWv` WU Y fuX WUSh GHGsrP RXW v`WUffvTU`W bRSWvWvc`v`T VuSv`T dRXXvwUXYRfU VRWR fcRVX
q3Ua
qW
5UTW!
CDEFFGHIGP RvfUV `cVUX dR fcXU VRWR cS XvT`vavYR`Wf VScb fcRV bUSacSdR`YUh ipFqrspHP a R `cVU aRvfX cWUS X WSURdX Yc`WvùU Wc fcRV d bUSacSdR`YU vWh yVdv`X YR` URXvf SUYcwUS fcXW VRWR e SUfcRVv`T eufg UUVUS VRWRh GHGsrP fRWR fcXX bScWUY Wvc` g bUSacSdR`YU Yc`XvX WU`Y VuSv`T `cVU aRvfuSUh
36

135 - PGCon 2009 - Aster v6

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

135 - PGCon 2009 - Aster v6

Uploaded by

Copyright:

Available Formats

Building PetaByte Warehouses with Unmodified PostgreSQL

Emmanuel Cecchet Member of Research Staff May 21st, 2009

PGCon 2009, Ottawa 2009 Aster Data Systems

Enterprise Data Warehouse Under Stress

Enterprise Enterprise Data Data Warehouse Warehouse

PGCon 2009, Ottawa 2009 Aster Data Systems

Offloading The Enterprise Data Warehouse

Frontline Frontline Data Data Warehouse Warehouse

Enterprise Enterprise Data Data Warehouse Warehouse

Archival Archival Data Data Warehouse Warehouse

PGCon 2009, Ottawa 2009 Aster Data Systems

Requirements for Frontline Data Warehouses

Who is Aster Data Systems?

PGCon 2009, Ottawa 2009 Aster Data Systems

MySpace Frontline Data Warehouse

R STUUTVW XY`Wab c defgS h`i pqr stuvwuxtxy t

" ! d " !e f ! ghhi jklmno pqhh rs tuvutwrx

PGCon 2009, Ottawa 2009 Aster Data Systems

PGCon 2009, Ottawa 2009 Aster Data Systems

Petabyte Datawarehouse Design PostgreSQL as a building block

Service Oriented Architecture

PGCon 2009, Ottawa 2009 Aster Data Systems

Aster nCluster Database

Queen Server Group

Worker Server Group

Loader/Exporter Server Group

Aster nCluster Database

Query Processing: How It Works

PGCon 2009, Ottawa 2009 Aster Data Systems

Table Compression Architecture

cdefghii qhrste p cdefghii qhrste p cdefghii qhrste p

PGCon 2009, Ottawa 2009 Aster Data Systems

Table Compression Enables Powerful Archival

Compressed - High Compressed - Medium Compressed - Low Not Compressed

PGCon 2009, Ottawa 2009 Aster Data Systems

What is MapReduce and Why Should I Care?

Processes data in parallel across distributed cluster

Why is MapReduce significant?

Why is In-Database MapReduce significant?

Aster In-Database MapReduce

Data Store Engine Aster nCluster

PGCon 2009, Ottawa 2009 Aster Data Systems

Always On: Minimize Planned Downtime

Rebalance Data Live Live Queries Queries

Data Data Backup Backup

Load & Export

Backup & Restore

PGCon 2009, Ottawa 2009 Aster Data Systems

                 !    "     #   $  

   %     &    '    (        $   (  

PGCon 2009, Ottawa 2009 Aster Data Systems

PGCon 2009, Ottawa 2009 Aster Data Systems

Fault Tolerance & Automatic Online Failover

PGCon 2009, Ottawa 2009 Aster Data Systems

Using Commodity Hardware

' ( & 0  t u w u  w 'w t) yv %s)

PGCon 2009, Ottawa 2009 Aster Data Systems

Scaling OnOn-Demand to a PetaByte

More Blocks = More Power

$5k to $10k Node

Massive Power Per Rack

160 Cores 640 GB RAM 48 TB SAS

PGCon 2009, Ottawa 2009 Aster Data Systems

! " # $

% & ' ( $ (

' ( & 0 t u w u w 'w t) yv %s)