You are on page 1of 62

Really Big Elephants

Data Warehousing
with

PostgreSQL
Josh Berkus MySQL User Conference 2011

Included/Excluded
I will cover:

I won, cover:

advan a!es of "os !res for #$ confi!ura ion a%les&aces '(L)'L( windowin! &ar i ionin! *a eriali+ed views

hardware selec ion '-. ) %lo%s denor*ali+a ion #$ /uery unin! e0 ernal #$ ools %acku&s 1 u&!rades

What is a data warehouse !

synony"s etc#

Business In elli!ence

also BI)#$

-naly ics da a%ase 2nLine -naly ical "rocessin! 32L-"4 #a a Minin! #ecision Su&&or

$L%P

&s

DW
few lar!e %a ch i*&or s years of da a /ueries !enera ed %y lar!e re&or s /ueries can run for hours 80 o 20000 9-M

*any sin!le5row wri es curren da a /ueries !enera ed %y user ac ivi y 6 1s res&onse i*es 078 o 80 9-M

$L%P

&s

DW
1 o 10 users no cons rain s

100 o 1000 users cons rain s

Why use PostgreSQL 'or data warehousing!

(o"plex Queries
SELECT CASE WHEN ((SUM(inventory.closed_on_hand) + SUM(chan es.received) + SUM(chan es.ad!"st#ents) + SUM(chan es.trans$erred_in%chan es.trans$erred_o"t)) &' () THEN )*UN+((CAST(SUM(chan es.sold_and_closed + chan es.ret"rned_and_closed) AS n"#eric) , -(() . CAST(SUM(startin .closed_on_hand) + SUM(chan es.received) + SUM(chan es.ad!"st#ents) + SUM(chan es.trans$erred_in%chan es.trans$erred_o"t) AS n"#eric)/ 0) ELSE ( EN+ AS 12ercent_Sold1/ CASE WHEN (SUM(chan es.sold_and_closed) &' () THEN )*UN+(-((,((SUM(chan es.closed_#ar3do4n_"nits_sold),-.() . SUM(chan es.sold_and_closed))/ 0) ELSE ( EN+ AS 12ercent_o$_Units_Sold_4ith_Mar3do4n1/ CASE WHEN (SUM(chan es.sold_and_closed , _s3".retail_5rice) &' () THEN )*UN+(-((,(SUM(chan es.closed_#ar3do4n_dollars_sold),-.() . SUM(chan es.sold_and_closed , _s3".retail_5rice)/ 0) ELSE ( EN+ AS 1Mar3do4n_2ercent1/ 6(6 AS 12ercent_o$_Total_Sales1/ CASE WHEN SUM((chan es.sold_and_closed + chan es.ret"rned_and_closed) , _s3".retail_5rice) 7S NULL THEN ( ELSE SUM((chan es.sold_and_closed + chan es.ret"rned_and_closed) , _s3".retail_5rice) EN+ AS 1Net_Sales_at_)etail1/ 6(6 AS 12ercent_o$_Endin _7nventory_at_)etail1/ SUM(inventory.closed_on_hand , _s3".retail_5rice) AS 1Endin _7nventory_at_)etail1/ 1_store1.1la8el1 AS 1Store1/ 1_de5art#ent1.1la8el1 AS 1+e5art#ent1/ 1_vendor1.1na#e1 AS 19endor_Na#e1 :)*M inventory ;*7N inventory as startin *N inventory.4areho"se_id < startin .4areho"se_id AN+ inventory.s3"_id < startin .s3"_id LE:T *UTE) ;*7N ( SELECT 4areho"se_id/ s3"_id/ s"#(received) as received/ s"#(trans$erred_in) as trans$erred_in/ s"#(trans$erred_o"t) as trans$erred_o"t/ s"#(ad!"st#ents) as ad!"st#ents/ s"#(sold) as sold :)*M #ove#ent WHE)E #ove#ent.#ove#ent_date =ETWEEN 6>(-(%(?%(06 AN+ 6>(-(%(?%-@6 A)*U2 =B s3"_id/ 4areho"se_id ) as chan es *N inventory.4areho"se_id < chan es.4areho"se_id AN+ inventory.s3"_id < chan es.s3"_id ;*7N _s3" *N _s3".id < inventory.s3"_id ;*7N _4areho"se *N _4areho"se.id < inventory.4areho"se_id ;*7N _location_hierarchy AS _store *N _store.id < _4areho"se.store_id AN+ _store.ty5e < 6Store6 ;*7N _5rod"ct *N _5rod"ct.id < _s3".5rod"ct_id ;*7N _#erchandise_hierarchy AS _de5art#ent

(o"plex Queries

J2I: o& i*i+a ion


8 differen J2I: y&es a&&ro0i*a e &lannin! for 20; a%le <oins &lus nes ed su%/ueries

su%/ueries in any clause

windowin! /ueries recursive /ueries

Big Data )eatures


%i! a%les %i! da a%ases %i! %acku&s %i! u&da es %i! /ueries

&ar i ionin! a%les&aces "I(9 %inary re&lica ion resource con rol

Extensi*ility

add da a analysis func ionali y fro* e0 ernal li%raries inside he da a%ase


financial analysis !ene ic se/uencin! a&&ro0i*a e /ueries da a y&es a!!re!a es func ions o&era ors

crea e your own:


(o""unity
>I,* runnin! a &ar i ionin! sche*e usin! 28? a%les wi h a *a0i*u* of 1? *illion rows 3na*ely I"v@5addresses4 and a curren o al of a%ou 278 %illion rows= here are no dele es hou!h= %u lo s of u&da es7A >I use "os !reSQL %asically as a da a warehouse o s ore all he !ene ic da a ha our la% !enera es B $i h his confi!ura ion I fi!ure I,ll have CD(B for *y *ain da a a%les and 1(B for inde0es7 A

lo s of e0&erience wi h lar!e da a%ases %lo!s= ools= online hel&

Sweet Spot
0 8 10 18 20 28 D0

MySQL

"os !reSQL

#$ #a a%ase
0 8 10 18 20 28 D0

DW Data*ases

.er ica Ereen&lu* -s er #a a Info%ri!h (erada a Fadoo&)FBase

:e e++a Fadoo&#B Lucid#B Mone #B Sci#B "araccel

DW Data*ases

.er ica Ereen&lu* -s er #a a Info%ri!h (erada a Fadoo&)FBase

:e e++a Fadoo&#B Lucid#B Mone #B Sci#B "araccel

+ow do I con'igure PostgreSQL 'or data warehousing!

,eneral Setup

La es version of "os !reSQL Sys e* wi h lo s of drives

? o @G drives

or 2 o 12 SS#s

Fi!h5 hrou!h&u 9-I# 10 o 80 EB s&ace

$ri e ahead lo! 3$-L4 on se&ara e disk3s4

separate the DW wor-load onto its own ser&er

Settings
'ew connections
#aC_connections < -( to D(

raise those "e"ory li"its.


shared_8"$$ers < -.? to E o$ )AM 4or3_#e# < ->?M= to -A= #aintenance_4or3_#e# < 0->M= to -A= te#5_8"$$ers < ->?M= to -A= e$$ective_cache_siFe < G o$ )AM 4al_8"$$ers < -HM=

/o auto&acuu"
a"tovac""# < o$$ vac""#_cost_delay < o$$

do your .-CUUMs and -:-LHI's as &ar of he %a ch load &rocess

usually several of he*

also *ain ain a%les %y &ar i ionin!

What are ta*lespaces!

logical data extents

le s you &u so*e of your da a on s&ecific devices ) disks

C)EATE TA=LES2ACE 6history_lo 6 L*CAT7*N 6.#nt.san>.history_lo 6I ALTE) TA=LE history_lo history_lo I TA=LES2ACE

ta*lespace reasons

&aralleli+e access

your lar!es >fac a%leA on one a%les&ace i s inde0es on ano her

no as useful if you have a !ood S-:

e*& a%les&ace for e*& a%les *ove key <oin a%les o SS# *i!ra e o new s ora!e one a%le a a i*e

What is E%L and how do I do it!

Extract0 %rans'or"0 Load

how you urn e0 ernal raw da a in o nor*ali+ed da a%ase da a


-&ache lo!s J we% analy ics #B CS. "2S files J financial re&or in! #B 2L(" server J 105year da a warehouse

also called 'L( when he ransfor*a ion is done inside he da a%ase

"os !reSQL is &ar icularly !ood for 'L(

L1 I/SER%

%a ch I:S'9(s in o 100,s or 1000,s &er ransac ion

row5a 5a5 i*e is very slow

crea e and load i*&or a%les in one ransac ion add inde0es and cons rain s af er load inser several s rea*s in &arallel

%u no *ore han C"U cores

L1 ($P2

"owerful= efficien deli*i ed file loader


al*os %u!5free 5 we use i for %acku& D58K fas er han inser s works wi h *os deli*i ed files also have o know s ruc ure in advance ry &!Lloader for %e er C2"H

:o faul 5 oleran

L1 ($P2
C*2B 4e8lo _ne4 :)*M 6.#nt.trans$ers.4e8lo s.4e8lo % >(--(H(0.csv6 4ith csvI C*2B tra$$ic_sna5shot :)*M 6tra$$ic_>(--(H(0-@>>D-6 deli#iter 6J6 n"lls as 6N6I Kco5y 4e8lo _s"##ary_!"ne T* 6+es3to5.4e8lo %!"ne>(--.csv6 4ith csv headerI

L1 in 3#41 )DW
C)EATE :*)E7AN TA=LE ra4_hits ( hit_ti#e T7MESTAM2/ 5a e TELT ) SE)9E) $ile_$d4 *2T7*NS ($or#at 6csv6/ deli#iter 6I6/ $ilena#e 6.var.lo .hits.lo 6)I

L1 in 3#41 )DW
C)EATE TA=LE hits_>(--(D-H-M AS SELECT 5a e/ co"nt(,) :)*M ra4_hits WHE)E hit_ti#e ' 6>(--%(D%-H -HN((N((6 AN+ hit_ti#e &< 6>(--%(D%-H -MN((N((6 A)*U2 =B 5a eI

%1 te"porary ta*les
C)EATE TEM2*)A)B TA=LE *N C*MM7T +)*2 sales_records_!"ne_roll"5 AS SELECT seller_id/ location/ sell_date/ s"#(sale_a#o"nt)/ array_a (ite#_id) :)*M ra4_sales WHE)E sell_date =ETWEEN 6>(--%(H%(-6 AN+ 6>(--%(H%O( >ON0@N0@.@@@6 A)*U2 =B seller_id/ location/ sell_dateI

in 3#41 unlogged ta*les

like *yIS-M wi hou he risk

C)EATE UNL*AAE+ TA=LE cleaned_lo _i#5ort AS SELECT hit_ti#e/ 5a e :)*M ra4_hits/ hit_4ater#ar3 WHE)E hit_ti#e ' last_4ater#ar3 AN+ is_valid(5a e)I

%1 stored procedures

*ul i&le lan!ua!es


SQL "L)&!SQL "L)"erl "L)"y hon "L)"F" "L)9 "L)Java allows you o use e0ernal da a &rocessin! li%raries in he da a%ase

cus o* a!!re!a es= o&era ors= *ore

CREATE OR REPLACE FUNCTION normalize_query ( queryin text ) RETURNS TE T LAN!UA!E PLPERL STA"LE STRICT AS #$# % t&i' $un(tion )normalize') querie' *y 'tri++in, out (on'tant'% 'ome re,exe' *y !uillaume Smet un.er T&e Po't,reS/L Li(en'elo(al #_ 0 #_1234 %$ir't (leanu+ t&e 5&ite'+a(e '67'86 6,4 '67'9696,4 '69(7S)69 #:6,4 '6;7'66,4 '67'#66,4 %remo<e any .ou*le quote' an. quote. text '677=66,4 '6=1;=3>=6==6,4 '6==(==)86==6,4 %remo<e TRUE an. FALSE '6(7?)TRUE(7?)6#:"OOL#@6,i4 '6(7?)FALSE(7?)6#:"OOL#@6,i4 %remo<e any *are num*er' or &ex num*er' '6(1;aAzAAB_7#A3)AC(12AD38)6#E:F26,4 '6(1;aAz_7#A3)2x12ADaA$3E:9:2F6#E:F2x6i,4 %normalize any IN 'tatement' '6(IN7'>)7(17=2x97'3>7)6#E:F(---)6i,4 %return t&e normalize. query return #_4 #$#4

C)EATE *) )E2LACE :UNCT7*N $_ ra5h>() )ETU)NS teCt AS 6 sPl &% 5aste(1SELECT id as C/hit as y :)*M #yte#5 L7M7T O(1/se5<11)I str &% c(5 .s5i.eCec(sPl))I #y#ain &% 1Ara5h >1I #ys"8 &% 5aste(1The 4orst o$$ender isN 1/strQ-/OR/1 4ith 1/strQ-/>R/1 hits1/se5<11)I #yCla8 &% 1To5 O( 72 Addresses1I #yyla8 &% 1N"#8er o$ Hits1I 5d$(66.t#5. ra5h>.5d$66)I 5lot(str/ty5e<181/#ain<#y#ain/s"8<#ys"8/Cla8<#yCla8/yla8 <#yyla8/l4d<O)I #teCt(12ro8es 8y intr"sive 72 Addresses1/side<O)I dev.o$$()I 5rint(66+*NE66)I 6 LANAUAAE 5lrI

EL% %ips

%ulk inser in o a new a%le ins ead of u&da in!)dele in! an e0is in! a%le u&da e all colu*ns in one o&era ion ins ead of one a a i*e use views and cus o* func ions o si*&lify your /ueries inser in! in o your lon!5 er* a%les should %e he very las s e& M no u&da es af erN

What5s a windowing 6uery!

regular aggregate

windowing 'unction

TA=LE events ( event_id 7NT/ event_ty5e TELT/ start T7MESTAM2TS/ d"ration 7NTE)9AL/ event_desc TELT )I

SELECT MAL(conc"rrent) :)*M ( SELECT SUM(tally) *9E) (*)+E) =B start) AS conc"rrent :)*M ( SELECT start/ -NN7NT as tally :)*M events UN7*N ALL SELECT (start + d"ration)/ %:)*M events ) AS event_vert) AS ecI

U2+ATE 5artition_na#e SET dro5_#onth < dro5it :)*M ( SELECT ro"nd_id/ CASE WHEN ( ( ro4_n"#8er() over (5artition 8y tea#_id order 8y tea#_id/ total_5oints) ) &< ( dro5_lo4est ) ) THEN ( ELSE - EN+ as dro5it :)*M ( SELECT tea#.tea#_id/ ro"nd.ro"nd_id/ #onth_5oints as total_5oints/ ro4_n"#8er() *9E) ( 5artition 8y tea#.tea#_id/ 3al.5ositions order 8y tea#.tea#_id/ 3al.5ositions/ #onth_5oints desc ) as ordinal/ at_least/ n"#dro5 as dro5_lo4est :)*M 5artition_na#e as rdro5 ;*7N ro"nd US7NA (ro"nd_id) ;*7N tea# US7NA (tea#_id) ;*7N 5ic3 *N ro"nd.ro"nd_id < 5ic3.ro"nd_id and 5ic3.5ic3_5eriod T' this_5eriod LE:T *UTE) ;*7N 3ee5_at_least 3al *N rdro5.5ool_id < 3al.5ool_id and 5ic3.5osition_id < any ( 3al.5ositions ) WHE)E rdro5.5ool_id < this_5ool AN+ tea#.tea#_id < this_tea# ) as ran3in WHE)E ordinal ' at_least or at_least is n"ll ) as dro5lo4 WHE)E dro5lo4.ro"nd_id < 5artition_na#e .ro"nd_id AN+ 5artition_na#e .5ool_id < this_5ool AN+ dro5it < (I

SELECT ro"nd_id/ CASE WHEN ( ( ro4_n"#8er() *9E) (5artition 8y tea#_id order 8y tea#_id/ total_5oints) ) &< ( dro5_lo4est ) ) THEN ( ELSE - EN+ as dro5it :)*M ( SELECT tea#.tea#_id/ ro"nd.ro"nd_id/ #onth_5oints as total_5oints/ ro4_n"#8er() *9E) ( 5artition 8y tea#.tea#_id/ 3al.5ositions order 8y tea#.tea#_id/ 3al.5ositions/ #onth_5oints desc ) as ordinal

strea" processing SQL

re&lace *ul i&le /ueries wi h a sin!le /uery

avoid scannin! lar!e a%les *ul i&le i*es and MB of da a rans*ission 3for so*e da a *inin! asks4

re&lace &a!es of a&&lica ion code

SQL al erna ive o *a&)reduce

+ow do I partition "y ta*les!

Postgres partitioning

%ased on table inheritance and constraint exclusion


&ar i ions are also full a%les e0&lici cons rain s define he ran!e of he &ar ion ri!!ers or 9UL's handle inser )u&da e

C)EATE TA=LE sales ( sell_date T7MESTAM2TS N*T NULL/ seller_id 7NT N*T NULL/ ite#_id 7NT N*T NULL/ sale_a#o"nt NUME)7C N*T NULL/ narrative TELT )I

C)EATE TA=LE sales_>(--_(H ( C*NST)A7NT 5artition_date_ran e CHECU (sell_date '< 6>(--%(H%(-6 AN+ sell_date & 6>(--%(M%(-6 ) ) 7NHE)7TS ( sales )I

C)EATE :UNCT7*N sales_insert () )ETU)NS tri er LANAUAAE 5l5 sPl AS V$V =EA7N CASE WHEN sell_date & 6>(--%(H%(-6 THEN 7NSE)T 7NT* sales_>(--_(0 9ALUES (NEW.,) WHEN sell_date & 6>(--%(M%(-6 THEN 7NSE)T 7NT* sales_>(--_(H 9ALUES (NEW.,) WHEN sell_date '< 6>(--%(M%(-6 THEN 7NSE)T 7NT* sales_>(--_(M 9ALUES (NEW.,) ELSE 7NSE)T 7NT* sales_over$lo4 9ALUES (NEW.,) EN+I )ETU)N NULLI EN+IV$VI C)EATE T)7AAE) sales_insert =E:*)E 7NSE)T *N sales :*) EACH )*W ELECUTE 2)*CE+U)E sales_insert()I

Postgres partitioning

,ood for:

Bad for:

>rollin! offA da a #B *ain enance /ueries which use he &ar i ion key under D00 &ar i ions inser &erfor*ance

ad*inis ra ion /ueries which do no use he &ar i ion key J2I:s over D00 &ar i ions u&da e &erfor*ance

you need a data expiration policy

you can, &lan your #$ o herwise


se s your s ora!e re/uire*en s le s you &ro<ec how /ueries will run when da a%ase is >fullA &eo&le don, like alkin! a%ou dele in! da a

will ake a lo of *ee in!s

you need a data expiration policy


raw i*&or da a de ail5level ransac ions de ail5level we% lo!s rollu&s

1 *on h D years 1 year 10 years

What5s a "ateriali7ed &iew!

6uery results as ta*le

calcula e once= read *any i*e


co*&le0)e0&ensive /ueries fre/uen ly referenced of en &ar of a /uery au o*a!ic su&&or no co*&le e ye

no necessarily a whole /uery

*anually *ain ained in "os !reSQL

SELECT 5a e/ C*UNT(,) as total_hits :)*M hit_co"nter WHE)E date_tr"nc(6day6/ hit_date) =ETWEEN ( no4() AN+ no4() % 7NTE)9AL 6M days6 ) *)+E) =B total_hits +ESC L7M7T -(I

C)EATE TA=LE 5a e_hits ( 5a e TELT/ hit_day +ATE/ total_hits 7NT/ C*NST)A7NT 5a e_hits_53 2)7MA)B UEB(hit_day/ 5a e) )I

each dayN 7NSE)T 7NT* 5a e_hits SELECT 5a e/ date_tr"nc(6day6/ hit_date) as hit_day/ C*UNT(,) as total_hits :)*M hit_co"nter WHE)E date_tr"nc(6day6/ hit_date) < date_tr"nc(6day6/ no4() % 7NTE)9AL 6- day6) *)+E) =B total_hits +ESCI

SELECT 5a e/ total_hits :)*M 5a e_hits WHE)E hit_date =ETWEEN no4() AN+ no4() % 7NTE)9AL 6M days6I

"aintaining "at&iews
BES%1 ,$$D1 B-# for #$: u&da e *a views a %a ch load i*e u&da e *a view accordin! o clock)calendar u&da e *a views usin! a ri!!er

"at&iew tips

*a views should %e s*all

1)10 o O of 9-M

each *a view should su&&or several /ueries

or one really really i*&or an one

runca e ; inser = don, u&da e inde0 *a views like cra+y

(ontact

Josh Berkus: <oshP&!e0&er s7co*

%lo!: %lo!s7i ool%o07co*)da a%ase)sou& &!e0&er s: www7&!e0&er s7co* &!Con: 2 awa: May 1Q520 2&enSourceBrid!e: "or land: June
(his alk is co&yri!h 2010 Josh Berkus and is licensed under he crea ive co**ons a ri%u ion license7 S&ecial hanks for *a erials o: 'lein Mus ain 3"L)94= Fi oshi Farada and #avid Re er 3windowin! func ions4= -ndrew #uns an 3fileLR#$4

"os !reSQL: www7&os !res/l7or!

U&co*in! 'ven s

You might also like