You are on page 1of 18

Building PetaByte Warehouses with Unmodified PostgreSQL

Emmanuel Cecchet Member of Research Staff May 21st, 2009

Topics
Introduction to frontline data warehouses PetaByte warehouse design with PostgreSQL Aster contributions to PostgreSQL Q&A

PGCon 2009, Ottawa 2009 Aster Data Systems

Enterprise Data Warehouse Under Stress

Enterprise Enterprise Data Data Warehouse Warehouse

PGCon 2009, Ottawa 2009 Aster Data Systems

Offloading The Enterprise Data Warehouse

Frontline Frontline Data Data Warehouse Warehouse

Enterprise Enterprise Data Data Warehouse Warehouse

Archival Archival Data Data Warehouse Warehouse

Petabytes source data 24 x7 Availability TB/Day Load capacity In-database transforms Rapid data access

Petabytes detailed data Low cost/TB Flexible compression On-line access Aging out to off-line

PGCon 2009, Ottawa 2009 Aster Data Systems

Requirements for Frontline Data Warehouses


    !"#$"% &! '(#'' "!&!')0" 1"! &%3"( &! )% 40 3!"&!" 2 2 56 7#'#)" ) ))"% %&8)#'" 9#)#'#$" ))"% %&8)#'" @6A BC D 6C EF )$" % #)G 0" H "(!0 P") )$" I
5
PGCon 2009, Ottawa 2009 Aster Data Systems

Who is Aster Data Systems?


Aster nCluster is a software-only RDBMS for large-scale frontline data warehousing
High performance High availability High value analytics Low cost Always Parallel MPP architecture Always On on-line operations In-Database MapReduce Petabytes on commodity HW

PGCon 2009, Ottawa 2009 Aster Data Systems

MySpace Frontline Data Warehouse


  !" 

    

#$$$%

 & &"!H"!

     

EFG 77)1 H I IB BP P7Q 7Q I 9 I 3C) D 5Q1 3C) 9D 5Q1 '()01 234530 67(8) 9 '()01 @ABC D 01

R STUUTVW XY`Wab c defgS h`i pqr stuvwuxtxy t

" ! d " !e f ! ghhi jklmno pqhh rs tuvutwrx

PGCon 2009, Ottawa 2009 Aster Data Systems

Topics
Introduction to frontline data warehouses PetaByte warehouse design with PostgreSQL Aster contributions to PostgreSQL Q&A

PGCon 2009, Ottawa 2009 Aster Data Systems

Petabyte Datawarehouse Design PostgreSQL as a building block


Do not hack PostgreSQL to serve the distributed database Build on top of mainline PostgreSQL Use standard Postgres APIs

Service Oriented Architecture


Hierarchical query planning and optimization Shared nothing replication treating Postgres and Linux as a service Compression at the OS level transparently to PostgreSQL In-database Map-Reduce out-of-process

PGCon 2009, Ottawa 2009 Aster Data Systems

Aster nCluster Database


Queen Node
Queries/Answers Queries/Answers

Queen Server Group

uy yyv tuwxvwtuy y u t uwu sttwuvy xwy   ! "#$# %&#'( ) "#0"#1( #02)1)" ( #301#))" 4 5"# 01#))" $ $11 1 $ ) #1 6)7  1 $ "7# 8"#9# @)A 6 $11 A)1 BCD "# E29 F$# G1#))" 7) $11 ) " 100#"0#)1 HI  P)#)Q 7 HI $ $11 1#" 8"#9#

Worker Node
Queries Queries

Worker Server Group

Data Data

Loader Node

Loader/Exporter Server Group

Aster nCluster Database


10
PGCon 2009, Ottawa 2009 Aster Data Systems

Query Processing: How It Works


g pS pS
S gVVUb

"

     

2 !5D QP #

7808

7808

7808

7808

A: Users send queries to the Queen (via ACT, ODBC/JDBC, BI tools, etc.) B: The Queen parses the query, creates an optimal distributed plan, sends subqueries to the Workers, and supervises the processing C: Workers execute their subqueries (locally or via distributed communication) D: The Queen aggregates Worker results and returns the final result to the user

11

PGCon 2009, Ottawa 2009 Aster Data Systems

Table Compression Architecture


Query

cdefghii qhrste p cdefghii qhrste p cdefghii qhrste p

Logical Database (eg. Schema) cdefghiip usvw cdefghiip usvw cdefghii cdefghii d qhrste p cdefghiip usvw cdefghiip usvw xdy p cdefghiihr cdefghiip usvw cdefghiip usvw cdefghii p cdefghiip usvw cdefghiip usvw rs t q h e cdefghiip usvw cdefghiip usvw cdefghii p xdy cdefghiip usvw cdefghiip usvw Physical Tablespaces

How It Works

%&'& () 012345))56 7 65012345))56 8591@ 91A(0&9 6&'&8&)5 B'1456 (C 6(DD545C' '&895)3&05) 6535C6(CA 1C 012345))(1C 95E59 F(AG 012345))(1C 4&'(1) H8(AA54 8910I )(P5) Q(596 2145 01)' )&E(CA)R %&'&8&)5 '4&C)3&45C' H5&)5 1D DS'S45 T1)'A45) S3A4&65)R U1C0S445C0Q HG(AGV354D142&C05 2S9'(V'&895 012345))(1CR T54D142&C05W XY5@V41@` aS54(5) 61Cb' C556 DS99 '&895 65012345))(1C

Architecture Benefits:

12

PGCon 2009, Ottawa 2009 Aster Data Systems

Table Compression Enables Powerful Archival

Compressed - High Compressed - Medium Compressed - Low Not Compressed

Older data accessed less frequently Compress to save space and cost Oldest data is compressed the most, recent is compressed the least Compressed tables are fully available for queries (true online archival)

13

PGCon 2009, Ottawa 2009 Aster Data Systems

What is MapReduce and Why Should I Care?


What is MapReduce?
Popularized by Google
http://labs.google.com/papers/mapreduce.html

Processes data in parallel across distributed cluster

Why is MapReduce significant?


Empowers ordinary developers Write application logic, not debug cluster communication code

Why is In-Database MapReduce significant?


Unites MapReduce with SQL: power invoked from SQL Develop SQL/MR functions with common languages
14
PGCon 2009, Ottawa 2009 Aster Data Systems

Aster In-Database MapReduce


Users, analysts, applications

SQL/MR Functions

In-Database MapReduce

Data Store Engine Aster nCluster


15
PGCon 2009, Ottawa 2009 Aster Data Systems

Slide 15

In-Database MapReduce
Extensible framework (MapReduce + SQL)
Flexible: Map-Reduce expressiveness, languages, polymorphism Performance: Massively parallel, computational push-down Availability: Fault isolation, resource management

Out-of-process executables
Does not use PL/* for custom code execution Can execute Map and Reduce functions in any language that has a runtime on Linux (e.g. Java, Python, C#, C/C++, Ruby, Perl, R, etc) Standard PostgreSQL APIs to send/receive data to executables Fault isolation, security and resource management for arbitrary user code

16

PGCon 2009, Ottawa 2009 Aster Data Systems

Always On: Minimize Planned Downtime

Rebalance Data Live Live Queries Queries

Add Capacity

Data Data Backup Backup

Load & Export

Backup & Restore

17

PGCon 2009, Ottawa 2009 Aster Data Systems

Precision Scaling

                 !    "     #   $  

   %     &    '    (        $   (  

When more CPU/memory/capacity are needed, new nodes can be added for scale-out. Precision Scaling uses standard PostgreSQL APIs to migrate vWorker partitions to new nodes either for load balancing (more compute power) or capacity expansion Example: Assume Workers 1/2/3 are 100% CPU-bottlenecked. Incorporation adds a new Worker4 node and migrates over vWorker partitions D/H/L. As a result of loadbalancing, CPU-utilization drops to 75% per node, eliminating hotspots.

18

PGCon 2009, Ottawa 2009 Aster Data Systems

Replication
!&H#%" %#!#4"%  &"!)0" )% "'
H#4## I

 vwvwtu ty wy wv v yv tu tu utv ut u v uv t ut wx wvwtu tvvy wuyv v tyy u yyv
txv

!"#) !& !"&1" )% "#)( #  "!H#0"$ %uyuv vt tyvy w& t ' u (v ywwu wux& )0 wwvwy vt vx v uy

19

PGCon 2009, Ottawa 2009 Aster Data Systems

Fault Tolerance & Automatic Online Failover


9@55A 1234536
   

BCD EFGHIP

      Q      $ Q      % 

1234537  Q  
   

   &Q  
   

1234538 Q    Q       
       

! "

123453R  Q    Q      !
       

# $

123453S Q "   Q      # 
       

% &

Replication Failover
Automatic, non-disruptive, graceful performance impact

Replication Restoration
Delta-based (fast) and online (non-disruptive)

20

PGCon 2009, Ottawa 2009 Aster Data Systems

Using Commodity Hardware


w wv v u v  swy t xuw v

 

. . .
 x vwst sy & x vw sy  (wyv wxv t   ' 0'00'%' (wy ( w y  wu  (   0xu v

!"# !#
$ "

!%"&# 

   y  wv 

 

' ( & 0  t u w u  w 'w t) yv %s)

      

21

PGCon 2009, Ottawa 2009 Aster Data Systems

Scaling OnOn-Demand to a PetaByte

Commodity Hardware 2 TB Building Block Dell, HP, IBM, Intel x86 16 GB Memory 2.4TB of Storage
8 Disks

More Blocks = More Power

$5k to $10k Node

Massive Power Per Rack

012 1

U3ba`i

4V5`

160 Cores 640 GB RAM 48 TB SAS

22

PGCon 2009, Ottawa 2009 Aster Data Systems

Heterogeneous Hardware Support


 
...

Vi`i

CPU Memory Disk


... ....

Heterogeneous HW support enables customers to add cheaper/faster nodes to existing systems for investment protection Mix-n-match different servers as you grow (faster CPUs, more memory, bigger disk drives, different vendors, etc)

23

PGCon 2009, Ottawa 2009 Aster Data Systems

Topics
Introduction to frontline data warehouses PetaByte warehouse design with PostgreSQL Aster contributions to PostgreSQL Q&A

24

PGCon 2009, Ottawa 2009 Aster Data Systems

Error Logging in COPY

Separate good from malformed tuples Per tuple error information Errors to capture
Type mismatch (e.g. text vs. int) Check constraints (e.g. int < 0) Malformed chars (e.g. invalid UTF-8 seq.) Missing / extra column

Low-performance overhead Activated on-demand using environment variables

25

PGCon 2009, Ottawa 2009 Aster Data Systems

Error Logging in COPY

Detailed error context is logged along with tuple content

26

PGCon 2009, Ottawa 2009 Aster Data Systems

Error Logging Performance


1 million tuples COPY performance

27

PGCon 2009, Ottawa 2009 Aster Data Systems

Auto-partitioning in COPY
COPY into a parent table route tuples directly in the child table with matching constraints

COPY y2008 FROM data.txt

28

PGCon 2009, Ottawa 2009 Aster Data Systems

Auto-partitioning in loading
Activated on-demand
set tuple_routing_in_copy = 0/1;

Configurable LRU cache size


set tuple_routing_cache_size = 3;

COPY performance into parent table similar to direct COPY in child table if data is sorted Will leverage partitioning information in the future (WIP for 8.5)

29

PGCon 2009, Ottawa 2009 Aster Data Systems

Other contributions
Temporary tables and 2PC transactions [Auto-]Partitioning infrastructure Regression test suite LFI (http://lfi.sourceforge.net/)
Fault injection at the library level or below out-of-memory conditions, network connection errors, interrupted system calls, data corruption, hardware failures, etc Lightning talk on Friday!

30

PGCon 2009, Ottawa 2009 Aster Data Systems

Topics
Introduction to Aster and data warehousing PetaByte warehouse design with PostgreSQL Aster contributions to PostgreSQL Q&A

31

PGCon 2009, Ottawa 2009 Aster Data Systems

PetaByte Warehouses with PostgreSQL


Unmodified PostgreSQL Always Parallel MPP architecture Always On on-line operations In-Database MapReduce PetaByte on commodity Hardware

32

PGCon 2009, Ottawa 2009 Aster Data Systems

Aster Data Systems


Learn more
www.asterdata.com Free TDWI report on advanced analytics:
asterdata.com/mapreduce

Free Gartner webcast on mission-critical DW:


asterdata.com/gartner

Contact us
hello@asterdata.com

33

PGCon 2009, Ottawa 2009 Aster Data Systems

Bonus slides

34

PGCon 2009, Ottawa 2009 Aster Data Systems

nCluster Components
TWTW! S gVVUb pqaq  hiV!iq"b

$"#$%$"#

(08 9

I G

   

  

    7  8

  9 @ 
3C D

" 


D 573 C 480AC8C

12

D 2 5 A78

&'(

 ) 2G 0

nCluster

35

PGCon 2009, Ottawa 2009 Aster Data Systems

Aster Loader

Vq5 SqUqWBTW!

CDEFFGHIGP QRSTU VRWR XUWX YSURWU R VRWR WSR`XaUS bUSacSdR`YU ecWWfU`UYgh ipFqrspHP tufWvbfU QcRVUSX RSU dRbbUV Wc wxcSgUSX c` xcSgUS `cVUXh yaWUS bRSWvWvc`v`T QcRVUS vff bRSRffUf fcRV u`vuU VRWR v`Wc wxcSg USX h GHGsrP Qv`URSf XYRfRefU fcRV bUSacSdR`YU aScd QcRVUSX Wc xcSg USX

Scalable Partitioning
CDEFFGHIGP RSWvWvc`v`T duXW eU XYRfRefU R`V `cW vdbUVU WU fcRVv`T bScYUXX ipFqrspHP RY QcRVUS Y c`WRv`X R RSWvWvc`US vY uXUX R` RfTcSvWd Wc RXX vT` VRWR v`Wc euY gUWXh RY euYg UW vX u`vuUf dRbbUV Wc vWv` WU Y fuX WUSh GHGsrP RXW v`WUffvTU`W bRSWvWvc`v`T VuSv`T dRXXvwUXYRfU VRWR fcRVX

q3Ua

qW

5UTW!

CDEFFGHIGP RvfUV `cVUX dR fcXU VRWR cS XvT`vavYR`Wf VScb fcRV bUSacSdR`YUh ipFqrspHP a R `cVU aRvfX cWUS X WSURdX Yc`Wv`uU Wc fcRV d bUSacSdR`YU vWh yVdv`X YR` URXvf SUYcwUS fcXW VRWR e SUfcRVv`T eufg UUVUS VRWRh GHGsrP fRWR fcXX bScWUY Wvc` g bUSacSdR`YU Yc`XvX WU`Y VuSv`T `cVU aRvfuSUh

36

PGCon 2009, Ottawa 2009 Aster Data Systems

You might also like