7-Oracle DW - Mapreduce

Data Warehousing
Jens Teubner, TU Dortmund

jens.teubner@cs.tu-dortmund.de
Winter 2014/15
Jens Teubner Data Warehousing Winter 2014/15 1

Part VII
MapReduce et al.

Scaling Up Data Warehouse Systems
Growing expectations toward Data Warehouses:

increasing data volumes (Big Data)
increasing complexity of analyses
Problems:
OLAP queries are multi-dimensional queries
Curse of Dimensionality: indexes become ineffective
Indexes cant help to ght growing query complexity
Workloads become scan heavy.
Scaling up a server becomes expensive

Curse of Dimensionality
Elapsed Time for NN-search (ms) N=50000, image database, k=10
100000 Scan
R*-ree
X-tree
VA-File
10000
1000
100
0 5 10 15 20 25 30 35 40 45
Number of dimensions in vectors

Parallel Query Evaluation
Scans can be parallelized, however:
User
zip=44227 zip=44227 zip=44227
disk disk disk
parallel hardware (e.g., graphics processors)

cluster systems

SAP BI/BW
, SAP/R3
RDB, DWH
OLAP Using GPUs
FACTSHEET
FOCUS
WARE MOD
ULES
ERP, CRM
SCM
,
Flat Files
JEDO X SOFT
ST EMS
SOURCE SY
E.g., Jedox OLAP Accelerator
(uses NVIDIA Tesla GPUs):
OR
ACCELERAT
he iten und
emory ozessorein
enden Pr
In-GPU-M ule bestehen aus taus rbaren Berechnungen ih e
ren
GPU-Mod und para
llelisie
eitung erh
ht di
Moderne
NTURBO! be n erge be n, da ss man-
zeigen gerade in
komplexen
be r CP Us . Die para
llele Verarb
n sign i ka nt, beson-
BARC ha ten gen nalyse
enhauses stgenann zvorteil ge d Datena n im OLAP
-
s der mei Performan grien un ierten Zelle
gen eine rformance eit von Zu it konsolid
Datenabfra nce- un d Pe
Gesc hw in digk
nu ng en m
ss Intellige zeiten be
i um Berech
h Busine Reaktions n es sich
ichende vo n ders wen t.
Un zu re m an z l ha nd el mi-
stellt. die Perfor Datenwr
fe die GPU zu
limitieren ungenutz
t, icher auf
ufkommen en zial e fe r vo m Hauptspe y -Tec hn ologie:
ftspot tentrans U-Memor
ige Gesch Berichten ensiven Da e In-GP ig im
en wicht ionen und Um zeitint innovativ l vollstnd
se n, Simulat ts ch ei- t Je dox auf en der Wrfe U
von Pr og no
m en sk ritisch e En
ni m ie ren, se tz
r h lt di e Ze lld at
is se zw ischen CP
rato bn
d unterneh GPU Accele gen und
Erge
Anwen-
prucht un en . De r Je do x ic h An fra uf en di e
werden k
nn ss ledigl tzer la
g getroen cher, so da Fr den Nu werden
GPU-Spei n mssen. en Daten
nd el , In vest- b er tra gen werde r un d alle wichtig de t wer-
a Finanzen
, Ha und GPU schnelle rere GPUs
verwen
source:
Branchen
, wie etw pl ex e Planungssz
e-
ng en dadurch
deutlich
nn en auch meh gr eren GPU-
n ko m du Zudem k n
ft, msse e-
itgestellt. und eine
Jedox White Paper
ktiengesch zeit durchg gezeiten
zu in Echt sofort bere rzere Abfra
if- An al ysen nahe ag ile r kann so f r noch k r fel zu sorgen.
d Wha t , um so den, um te nw
vorliegen der- s groe Da
nngren die Anwen r besonder
hneller Ke ch zeitig auch Speicher f
ei
gl en
cheidung
steigt
en. Somit
Systems. n is se f r agile Entsder eine noch bessere
rgeb
Schnelle E GPU Accelerator haben menssteuerung: Stakeholde len
Anwen r im
Jens Teubner Data Warehousing Winter 2014/15 it dem Jedox er Untern
eh
ien m it exib 184
Parallel Databases
E.g., Teradata Database: SE

TA DATABA
TERADA
ASE
e data DATAB
larger th IONAL P
ARALLEL
sing. The
PE
th e b ig ger the TRADIT Initial Que

ry
es,
he queri im p ortant
s also Initial Que
ry
essing. It the play-
AMP AMP
AMP
istribute
AMP
ry
True Que
way to d ly a mong th
e
Parallelis
m
even
ute them Query
ion
).
ts of work Replicat
Compile
d
n
Executio
e d as
Serialized lenecks
perform AMP AMP
ries to be
AMP
rson, or
Bott
Process
AMP
si n g le p e sorts,
(typically
osed to a use they were aggregat
ions, Balanced AMP AMP
ca ce AMP
e n e ck b e Performan
AMP
ere the and joins)

cards wh
lar set of e F ig ure 1.)
. (S e
uentially Final Resu
lt
source: ge
Final Mer
Teradata White Paper
lt
Final Resu
lism. Its radata
r paralle e versus Te
igned fo e c is io n su p p ort
ar al lel Databas
x d - al P
s comple and distr
ib
Figure 1.
Tradition
all tasks
wn into sm ssors, known as D atabase.
ce
tware pro We refer to this
a se.
d ata b
he Jens Teubnerro cess
Data or (AMP).
Warehousing Winter 2014/15 185
Scalability Challenges
Challenges:
Robustness:
More components higher risk of failure
Failure of single component might take whole system
off-line.
Scalability/Elasticity:
Provision for peak load?
Use resources otherwise when DW not at peak load?
Add resources later (when business grows)?
Cost:
(Reliable) large installations tend to become expensive.
(Theres a relatively small market for very large systems.)

Scalability in Web Search
Search engines have faced similar challenges very early.
Task: generate inverted les

term cnt posting list
data warehouses are 1 doc1 :3
are cool cool 2 doc1 :4, doc2 :1
doc1 data 2 doc1 :1, doc2 :5
distribute 1 doc2 :3
cool guys distribute guys 1 doc2 :2
their data their 1 doc2 :4
doc2 warehouses 1 doc1 :2

Inverted File Generation
Idea: Break up index generation into two parts:

1 For each document, extract terms.
2 Collect terms into groups and emit an index entry per group.
E.g.,
1 foreach document doc do 8 foreach key, (values) do
2 pos 1; 9 count 0;
3 tokens parse (doc); 10 pList ();
4 foreach word in tokens do 11 foreach v values do
5 emit word, doc.id:pos; 12 pList.append (v);
6 pos pos + 1; 13 count count + 1;
14 emit key, count, pList;
7 collect key, (values . . . ) pairs;

Inverted File Generation
Observations: (for parallel execution)

For part 1, documents can be partitioned arbitrarily over
nodes.
For part 2 , all postings of one term must be collocated on the
same node (postings for different terms may be on different
nodes).
To establish collocation, data may have to be moved
(shufed) across nodes.

Distributed Index Generation
shufe
terms entries

terms entries

input terms entries result

(partitioned) (partitioned)
terms entries

terms entries


Generalization ( MapReduce)
The application pattern turns out to be highly versatile.
Only replace foreach bodies:

lines 26: f1 :: [, ] Mapper
lines 814: f2 :: , [] Reducer
Shufing (line 7) combines [, ] (list of key/value pairs) into a

list of , [] (pairs of key and list of values).
Shufing (combining) is generic.
MapReduce3 is a framework for distributed computing, where f1 and

f2 can be instantiated by the user.
3
Dean and Ghemawat. MapReduce: Simplied Data Processing on Large
Clusters. OSDI 2004.
Example: Webserver Log File Analysis
E.g., Webserver log le analysis
Task: For each client IP, report total trafc (in bytes).
Mapper and Reducer implementations?

MapReduce Illustrated
by @kerzol on Twitter
MapReduce
The MapReduce framework

decides on a number of Mappers and Reducers to instantiate,
decides the partitioning of of data and computation,
moves data as necessary and implements shufing;
considers cluster topology, system load, etc.,

interfaces with a distributed le system (Google File Syst.).
Apache Hadoop provides an open-source implementation of the

MapReduce concept; also comes with the Hadoop Distributed File
System, HDFS.

So What?
The idea seems straightforward. Why all the fuss?
Remember the challenges we stated?

Risk of failures; elasticity; cost
MapReduce was designed for large clusters of cheap machines.

Think of thousands of machines.
Failures are frequent (and have to be dealt with).
This is why MapReduce has become popular.

Failure Tolerance?
Trick:
Mapper and Reducer must be pure functions.
Their output depends only on their input.
No side effects.
Computation can be done anywhere, repeated if necessary.
MapReduce runtime:
Monitor job execution.
Job does not nish within expected time?
Restart on different node.
Might end up processing a task unit twice discard all
results but one.
Also used to improve performance (in case of stragglers).

Performance: Grep
E.g., scan 1010 100-byte words for three-character pattern.
1800 machines
uppercase; 30000
Input (MB/s)
e = GetCounter("uppercase");
each 2 2 GHz
20000
each
ng name, 2 contents):
String 160 GB IDE HDD
ch word w in contents: 10000
Gigabit Ethernet
IsCapitalized(w)):
percase->Increment(); 0
paper from
Intermediate(w, 2004
"1"); 20 40 60 80 100
Seconds
r values from individual worker machines
ly propagated to the master (piggybacked Figure 2: Data transfer rate over time
The
sponse). Leverage aggregate
master aggregates disk bandwidth.
the counter
uccessful
Thismap andis reduce
whattasks
we and
needreturns
for OLAP, too.
user code when the MapReduce operation disks, and a gigabit Ethernet link. The machines were
The current counter values are also dis- arranged in a two-level tree-shaped switched network
e master status page so that a human can with approximately 100-200 Gbps of aggregate band-
gress of the live computation. When aggre- width available at the root. All of the machines were
r values, the master eliminates the effects of in the same hosting facility and therefore the round-trip
Performance: Sort
20000 20000
20000
Done Done Done
Input (MB/s)
Input (MB/s)
15000 15000
Input (MB/s)
15000
10000 10000 10000
5000 5000 5000
0 0 0
500 1000 500 1000 500 1000
20000 20000 20000

Shuffle (MB/s)
Shuffle (MB/s)
Shuffle (MB/s)
15000 15000 15000
10000 10000 10000
5000 5000 5000
0 0 0
500 1000 500 1000 500 1000
20000 20000 20000

Output (MB/s)
Output (MB/s)
Output (MB/s)
15000 15000 15000
10000 10000 10000
5000 5000 5000
0
0 0
500 1000
500 1000 500 1000
Seconds
Seconds Seconds
(a) Normal execution (b) No backup tasks (c) 200 tasks killed
Figure 3: Data transfer rates over time for different executions of the sort program

MapReduce for Data Warehousing
MapReduce is not a database.

No tables, tuples, rows, schemas, indexes, etc.
Rather, MapReduce is based on les.

Typically kept in a distributed le system.
This is unfortunate:
No schema information to optimize, validate, etc.
No indexes (or other means to improve physical representation).
This is good:
Start analyzing immediately; dont wait for index creation, etc.
May ease ad-hoc analyses.

Beyond the Basic Idea
While the original MapReduce is proprietary to Google, Hadoop is

widely used in industry and research.
Java-based
Can run on heterogeneous platforms, cloud systems, etc.
Integration with other Apache technology
Hadoop Distributed File System (HDFS), HBase, etc.
Can hook into more functions than just Mapper and Reducer
e.g., pre-aggregate between map and shufe
modify partitioning, etc.
Many interfaces Hadoop database/data warehouse

Hadoop and Petabyte Sort Benchmark
Challenge: sort 1 TB of 100-byte records.

Hardware:
3800 nodes, 2 4 2.5 GHz per node
4 SATA disks, 8 GB RAM per node
Results:
GBytes Nodes Maps Reduces Repl. Time

500 1406 8000 2600 1 59 sec
1,000 1460 8000 2700 1 62 sec
100,000 3452 190,000 10,000 2 173 min
1,000,000 3658 80,000 20,000 2 975 min



MapReduce Databases: Load Times
30000 50000
Analysis. SIGMOD 2009.

Pavlo et al.. A Comparison of Approaches
25000
40000
20000
30000
seconds
seconds
15000
to Large-Scale Data
20000
10000
2262.2
10000
75.5
67.7
5000
0 0
50 Nodes 100 Nodes 25 Nodes 50 Nodes 100 Nodes 1 Nodes 10
Hadoop Vertica Hadoop
p Task Data Set Figure 2: Load Times Grep Task Data Set Figure 3: Load
Schema and physical (1TB/cluster)
data organization make loading slower on (20GB
the databases.
of 3TB of disk space in order to store ferent node based on the hash of its p
in HDFS, we were limited to running
Jens Teubner Data Warehousing Winter 2014/15
loaded, the columns are automatically
204
MapReduce Databases: Grep Benchmark
1500
to Large-Scale Data Analysis. SIGMOD 2009.

1250
1000
seconds
750
500
250
0
s 100 Nodes 25 Nodes 50 Nodes 100 Nodes
oop Vertica Hadoop
de Data Set Figure 5: Grep Task Results 1TB/cluster Data Set
MapReduce
es perform about 600,000leaves
unique result as collection
HTML documents, each with ofa les; collecting
unique URL. In into
single result costs addl. time.
oop. But in Fig- each document, we randomly generate links to other pages set us-
than a factor of ing a Zipfian distribution.
ount of data pro- We also generated two additional data sets meant to model log
Jens
ents. the re- Data Warehousing
ForTeubner Winter
files of HTTP 2014/15
server traffic. These data sets consist of values de- 205
MapReduce Databases: Aggregation

1800 1400
1600
1200
1400
1000
1200
seconds
seconds
1000 800
800 600
600
400
400
200
200
0 0
1 Nodes 10 Nodes 25 Nodes 50 Nodes 100 Nodes 1 Nodes 10 Nodes 25 Nodes 50 Nodes 100 Nodes
Vertica Hadoop Vertica Hadoop
Figure 7: Aggregation Task Results (2.5 million Groups) Figure 8: Aggregation Task Results (2,000 Groups)
the query coordinator, which outputs results to the user. The results Once this table is populated, it is then trivial to use a second query
in Figure 7 illustrate that the two DBMSs perform about the same to output the record with the largest totalRevenue field.
Databases limited by communication

for a large number of groups, as their runtime is dominated by the
cost, which is lower for
cost to transmit the large number of local groups and merge them
SELECT INTO Temp sourceIP,
AVG(pageRank) as avgPageRank,
at the coordinator. For the experiments using fewer nodes, Vertica
smaller group counts. SUM(adRevenue) as totalRevenue
performs somewhat better, since it has to read less data (since it
FROM Rankings AS R, UserVisits AS UV
can directly access the sourceIP and adRevenue columns), but it WHERE R.pageURL = UV.destURL
becomes slightly slower as more nodes are used. AND UV.visitDate BETWEEN Date(2000-01-15)
AND Date(2000-01-22)
Based on the results in Figure 8, it is more advantageous to use GROUP BY UV.sourceIP;
a column-store system when processing fewer groups for this task.
This is because the two columns accessed (sourceIP and adRev- SELECT sourceIP, totalRevenue, avgPageRank
FROM Temp
enue)Teubner
Jens consist of only
Data20 bytes out of the
Warehousing more than
Winter 200 bytes per
2014/15 ORDER BY totalRevenue DESC LIMIT 1; 206
MapReduce Databases: Join
1800 8000

1600 7000

1400
6000
1200
5000
seconds
seconds
1000
4000
800
3000
600
2000
400
85.0
36.1
31.3
31.9
28.2
28.0
29.2
29.4
21.5
15.7
200 1000
0 0
1 Nodes 10 Nodes 25 Nodes 50 Nodes 100 Nodes 1 Nod
Vertica DBMSX Hadoop
Figure 9: Join Task Results Figur

Joins are rather
records complex
for a particular to formulate
sourceIP in We
on a single node. MapReduce.
use the iden- seconds to split,
Repartitioning incurs high communication overhead.
tity Map function in the Hadoop API to supply records directly to the CPU overhe
the split process [1, 8]. iting factor for H
Joins canReduce
be accelerated using
Function: For each indexes.
sourceIP, the function adds up the Second, the p
adRevenue and computes the average pageRank, retaining the one that both the Us
MapReduce Databases
Persistent data data read ad-hoc:

Overhead for schema design, loading, indexing, etc.
Cost might amortize only after several queries/analyses.
Databases feature support for transactions.
Not needed for read-only workloads.
Language: SQL Java/C++/:

Write a new MapReduce program for each and every analysis?
User-dened functionality in SQL?
E.g., similarity measures, statistics functions, etc.
Debug SQL or MapReduce job?
Is there a good middle ground?

Apache Pig
Idea:
Data processing language that sits in-between SQL and
MapReduce.
Declarative (SQL-like; ; allow for optimization, easy
re-use and maintenance)
Procedural-style, rich data model (; programmers feel
comfortable)
Pig programs are compiled into MapReduce (Hadoop) jobs.

Pig Latin Example
S = LOAD 'sailors.csv' USING PigStorage(',')
AS (sid:int, name:chararray, rating:int, age:int);
B = LOAD 'boats.csv' USING PigStorage(',')schema on-the-y
AS (bid:int, name:chararray, color:chararray);
R = LOAD 'reserves.csv' USING PigStorage(',')
AS (sid:int, bid:int, day:chararray);
-- SELECT S.sid, R.day

-- FROM Sailors AS S, Reserves AS R
-- WHERE S.sid = R.sid AND R.bid = 101
programming style:
A = FILTER R BY (bid == 101); sequence of assignments
B = JOIN S BY sid, A by sid; ; data ow
X = FOREACH B GENERATE S::sid, A::day AS day;
STORE X into 'result.csv';

Pig Latin Data Model
Pig Latin features a fairly rich data model:
atoms:
e.g., 'foo', 42
tuples: sequence of elds of any data type
e.g., ('foo', 42)
access by eld name or position, tuples can be nested
bag: collection of tuples (possibly with duplicates)
{ }
('foo', 42)
e.g.,
(17, ('hello', 'world'))
map: collection of key value mappings
{ }
('lakers')
'fan of'
e.g., ('iPod')
age 20

Pig Latin Data Model
Pig Latins data types can be arbitrarily nested4

Contrast to 1NF data model in relational databases
Avoid joins, which MapReduce cant do too well.
Allow for sound data model, including grouping, etc.
Easier integration with user-dened functions
4
Keys for map types must be atomic, though (for efciency reasons).
Pig Latin Operators: FILTER
kids = FILTER users BY (age < 18);
Comparison operators: ==, eq, !=, neq, AND,

Can use user-dened functions arbitrarily.
Implementation in MapReduce?

Pig Latin Operators: FOREACH
FOREACH Sailors GENERATE

sid AS sailorId,
name AS sailorName,
( rating, age ) AS sailorInfo;
Apply some processing (e.g., item re-structuring) to every item

of a data set (; projection in Relational Algebra)
No loop dependence! parallel execution
(XQuerys FLWOR expressions provide a similar form of iteration.)

Pig Latin Operators: GROUP
sales_by_cust = GROUP sales BY customerName;
returns a bag (relation) with two elds: group key and bag of
tuples with that key value.
First eld is named group
Second eld is named by variable (alias in Pig
terminology) used in the GROUP statement (here: sales)

Pig Latin Operators: COGROUP
Group items from multiple data sets:

O = LOAD 'owner.csv' USING PigStorage(',')
AS (owner:chararray, pet:chararray);
{(Alice, turtle) , (Alice, goldfish) , (Alice, cat) , (Bob, dog) , (Bob, cat)}
F = LOAD 'friend.csv' USING PigStorage(',')
AS (person:chararray, friend:chararray);
{(Cindy, Alice) , (Mark, Alice) , (Paul, Bob) , (Paul, Jane)}
X =COGROUP
OBY owner, F BY friend;

(Alice, turtle) { }

(Cindy, Alice)

Alice,
(Alice, goldfish) ,

(Mark, Alice)

(Alice, cat)
( { } )
(Bob, dog) { }

Bob, , (Paul, Bob)

(Bob, cat)

( { })

Jane, {} , (Paul, Jane)

Pig Latin Operators: JOIN
join_result = JOIN results BY queryString,

revenue BY queryString;
Equi-joins only.
Cross product between elds 1 and 2 of COGROUP result.
temp = COGROUP results BY queryString;

revenue BY queryString;
join_result = FOREACH temp GENERATE
FLATTEN (results), FLATTEN (revenue);

Pig Latin: More Operators
Many additional operators ease common data analysis tasks, e.g.,

LOAD/STORE
(Not surprisingly, Pig works well together with HDFS.)
UNION
CROSS
ORDER
DISTINCT

Pig Latin: Debugging
Pig Latin was also designed with the development and analysis
workow in mind.
Interactive use of Pig (grunt).
Can run Pig programs locally (without Hadoop).
Commands to examine expression results.
DUMP: Write (intermediate) result to storage.
DESCRIBE: Print schema of an (intermediate) result.
EXPLAIN: Print execution plan.
ILLUSTRATE: View step-by-step execution of a plan; show
representative examples of (intermediate) result data.

7-Oracle DW - Mapreduce

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

7-Oracle DW - Mapreduce

Uploaded by

Copyright:

Available Formats

Data Warehousing

Jens Teubner, TU Dortmund

Jens Teubner Data Warehousing Winter 2014/15 1

Jens Teubner Data Warehousing Winter 2014/15 180

Growing expectations toward Data Warehouses:

Jens Teubner Data Warehousing Winter 2014/15 181

Elapsed Time for NN-search (ms) N=50000, image database, k=10

Jens Teubner Data Warehousing Winter 2014/15 182

Scans can be parallelized, however:

zip=44227 zip=44227 zip=44227

disk disk disk

parallel hardware (e.g., graphics processors)

Jens Teubner Data Warehousing Winter 2014/15 183

E.g., Teradata Database: SE

th e b ig ger the TRADIT Initial Que

ere the and joins)

Jens Teubner Data Warehousing Winter 2014/15 186

Search engines have faced similar challenges very early.

Task: generate inverted les

Jens Teubner Data Warehousing Winter 2014/15 187

Idea: Break up index generation into two parts:

Jens Teubner Data Warehousing Winter 2014/15 188

Observations: (for parallel execution)

Jens Teubner Data Warehousing Winter 2014/15 189

input terms entries result

Jens Teubner Data Warehousing Winter 2014/15 190

The application pattern turns out to be highly versatile.

Only replace foreach bodies:

Shufing (line 7) combines [, ] (list of key/value pairs) into a

MapReduce3 is a framework for distributed computing, where f1 and

Jens Teubner Data Warehousing Winter 2014/15 192

The MapReduce framework

considers cluster topology, system load, etc.,

Apache Hadoop provides an open-source implementation of the

Jens Teubner Data Warehousing Winter 2014/15 194

The idea seems straightforward. Why all the fuss?

Remember the challenges we stated?

MapReduce was designed for large clusters of cheap machines.

Jens Teubner Data Warehousing Winter 2014/15 195

Jens Teubner Data Warehousing Winter 2014/15 196

E.g., scan 1010 100-byte words for three-character pattern.

20000 20000 20000

20000 20000 20000

Jens Teubner Data Warehousing Winter 2014/15 198

MapReduce is not a database.

Rather, MapReduce is based on les.

Jens Teubner Data Warehousing Winter 2014/15 199

While the original MapReduce is proprietary to Google, Hadoop is

Jens Teubner Data Warehousing Winter 2014/15 200

Challenge: sort 1 TB of 100-byte records.

GBytes Nodes Maps Reduces Repl. Time

Jens Teubner Data Warehousing Winter 2014/15 201

Jens Teubner Data Warehousing Winter 2014/15 202

Jens Teubner Data Warehousing Winter 2014/15 203

Analysis. SIGMOD 2009.

Hadoop Vertica Hadoop

to Large-Scale Data Analysis. SIGMOD 2009.

oop Vertica Hadoop

de Data Set Figure 5: Grep Task Results 1TB/cluster Data Set

to Large-Scale Data Analysis. SIGMOD 2009.

Vertica Hadoop Vertica Hadoop

Databases limited by communication

to Large-Scale Data Analysis. SIGMOD 2009.

Pavlo et al.. A Comparison of Approaches