Professional Documents
Culture Documents
Winter 2014/15
MapReduce et al.
Problems:
OLAP queries are multi-dimensional queries
Curse of Dimensionality: indexes become ineffective
Indexes cant help to ght growing query complexity
Workloads become scan heavy.
Scaling up a server becomes expensive
100000 Scan
R*-ree
X-tree
VA-File
10000
1000
100
0 5 10 15 20 25 30 35 40 45
Number of dimensions in vectors
User
istribute
AMP
ry
True Que
way to d ly a mong th
e
Parallelis
m
even
ute them Query
ion
).
ts of work Replicat
Compile
d
n
Executio
e d as
Serialized lenecks
perform AMP AMP
ries to be
AMP
rson, or
Bott
Process
AMP
si n g le p e sorts,
(typically
osed to a use they were aggregat
ions, Balanced AMP AMP
ca ce AMP
e n e ck b e Performan
AMP
Challenges:
Robustness:
More components higher risk of failure
Failure of single component might take whole system
off-line.
Scalability/Elasticity:
Provision for peak load?
Use resources otherwise when DW not at peak load?
Add resources later (when business grows)?
Cost:
(Reliable) large installations tend to become expensive.
(Theres a relatively small market for very large systems.)
E.g.,
1 foreach document doc do 8 foreach key, (values) do
2 pos 1; 9 count 0;
3 tokens parse (doc); 10 pList ();
4 foreach word in tokens do 11 foreach v values do
5 emit word, doc.id:pos; 12 pList.append (v);
6 pos pos + 1; 13 count count + 1;
14 emit key, count, pList;
7 collect key, (values . . . ) pairs;
shufe
terms entries
terms entries
terms entries
3
Dean and Ghemawat. MapReduce: Simplied Data Processing on Large
Clusters. OSDI 2004.
Jens Teubner Data Warehousing Winter 2014/15 191
Example: Webserver Log File Analysis
E.g., Webserver log le analysis
Task: For each client IP, report total trafc (in bytes).
Mapper and Reducer implementations?
by @kerzol on Twitter
Jens Teubner Data Warehousing Winter 2014/15 193
MapReduce
Trick:
Mapper and Reducer must be pure functions.
Their output depends only on their input.
No side effects.
Computation can be done anywhere, repeated if necessary.
MapReduce runtime:
Monitor job execution.
Job does not nish within expected time?
Restart on different node.
Might end up processing a task unit twice discard all
results but one.
Also used to improve performance (in case of stragglers).
1800 machines
uppercase; 30000
Input (MB/s)
e = GetCounter("uppercase");
each 2 2 GHz
20000
each
ng name, 2 contents):
String 160 GB IDE HDD
ch word w in contents: 10000
Gigabit Ethernet
IsCapitalized(w)):
percase->Increment(); 0
paper from
Intermediate(w, 2004
"1"); 20 40 60 80 100
Seconds
r values from individual worker machines
ly propagated to the master (piggybacked Figure 2: Data transfer rate over time
The
sponse). Leverage aggregate
master aggregates disk bandwidth.
the counter
uccessful
Thismap andis reduce
whattasks
we and
needreturns
for OLAP, too.
user code when the MapReduce operation disks, and a gigabit Ethernet link. The machines were
The current counter values are also dis- arranged in a two-level tree-shaped switched network
e master status page so that a human can with approximately 100-200 Gbps of aggregate band-
gress of the live computation. When aggre- width available at the root. All of the machines were
r values, the master eliminates the effects of in the same hosting facility and therefore the round-trip
Jens Teubner Data Warehousing Winter 2014/15 197
Performance: Sort
20000 20000
20000
Done Done Done
Input (MB/s)
Input (MB/s)
15000 15000
Input (MB/s)
15000
10000 10000 10000
5000 5000 5000
0 0 0
500 1000 500 1000 500 1000
Shuffle (MB/s)
Shuffle (MB/s)
15000 15000 15000
10000 10000 10000
5000 5000 5000
0 0 0
500 1000 500 1000 500 1000
Output (MB/s)
15000 15000 15000
10000 10000 10000
5000 5000 5000
0
0 0
500 1000
500 1000 500 1000
Seconds
Seconds Seconds
(a) Normal execution (b) No backup tasks (c) 200 tasks killed
Figure 3: Data transfer rates over time for different executions of the sort program
This is unfortunate:
No schema information to optimize, validate, etc.
No indexes (or other means to improve physical representation).
This is good:
Start analyzing immediately; dont wait for index creation, etc.
May ease ad-hoc analyses.
30000 50000
20000
30000
seconds
seconds
15000
to Large-Scale Data
20000
10000
2262.2
10000
75.5
67.7
5000
0 0
50 Nodes 100 Nodes 25 Nodes 50 Nodes 100 Nodes 1 Nodes 10
p Task Data Set Figure 2: Load Times Grep Task Data Set Figure 3: Load
Schema and physical (1TB/cluster)
data organization make loading slower on (20GB
the databases.
of 3TB of disk space in order to store ferent node based on the hash of its p
in HDFS, we were limited to running
Jens Teubner Data Warehousing Winter 2014/15
loaded, the columns are automatically
204
MapReduce Databases: Grep Benchmark
1500
1000
seconds
750
500
250
0
s 100 Nodes 25 Nodes 50 Nodes 100 Nodes
MapReduce
es perform about 600,000leaves
unique result as collection
HTML documents, each with ofa les; collecting
unique URL. In into
single result costs addl. time.
oop. But in Fig- each document, we randomly generate links to other pages set us-
than a factor of ing a Zipfian distribution.
ount of data pro- We also generated two additional data sets meant to model log
Jens
ents. the re- Data Warehousing
ForTeubner Winter
files of HTTP 2014/15
server traffic. These data sets consist of values de- 205
MapReduce Databases: Aggregation
1600
1200
1400
1000
1200
seconds
seconds
1000 800
800 600
600
400
400
200
200
0 0
1 Nodes 10 Nodes 25 Nodes 50 Nodes 100 Nodes 1 Nodes 10 Nodes 25 Nodes 50 Nodes 100 Nodes
Figure 7: Aggregation Task Results (2.5 million Groups) Figure 8: Aggregation Task Results (2,000 Groups)
the query coordinator, which outputs results to the user. The results Once this table is populated, it is then trivial to use a second query
in Figure 7 illustrate that the two DBMSs perform about the same to output the record with the largest totalRevenue field.
seconds
seconds
1000
4000
800
3000
600
2000
400
85.0
36.1
31.3
31.9
28.2
28.0
29.2
29.4
21.5
15.7
200 1000
0 0
1 Nodes 10 Nodes 25 Nodes 50 Nodes 100 Nodes 1 Nod
Idea:
Data processing language that sits in-between SQL and
MapReduce.
Declarative (SQL-like; ; allow for optimization, easy
re-use and maintenance)
Procedural-style, rich data model (; programmers feel
comfortable)
Pig programs are compiled into MapReduce (Hadoop) jobs.
4
Keys for map types must be atomic, though (for efciency reasons).
Jens Teubner Data Warehousing Winter 2014/15 212
Pig Latin Operators: FILTER
Implementation in MapReduce?
returns a bag (relation) with two elds: group key and bag of
tuples with that key value.
First eld is named group
Second eld is named by variable (alias in Pig
terminology) used in the GROUP statement (here: sales)
Implementation in MapReduce?
Equi-joins only.
Implementation in MapReduce?
Cross product between elds 1 and 2 of COGROUP result.
Pig Latin was also designed with the development and analysis
workow in mind.
Interactive use of Pig (grunt).
Can run Pig programs locally (without Hadoop).
Commands to examine expression results.
DUMP: Write (intermediate) result to storage.
DESCRIBE: Print schema of an (intermediate) result.
EXPLAIN: Print execution plan.
ILLUSTRATE: View step-by-step execution of a plan; show
representative examples of (intermediate) result data.