Professional Documents
Culture Documents
processing
I = load ‘/mydata/images’ using ImageParser() as (id, image);
F = foreach I generate id, detectFaces(image);
store F into ‘/mydata/faces’;
Example 2
Find sessions that end with the “best” page.
Visits Pages
user url time url pagerank
Amy www.cnn.com 8:00 www.cnn.com 0.9
Amy www.crap.com 8:05 www.flickr.com 0.9
Amy www.myblog.com 10:00 www.myblog.com 0.7
Amy www.flickr.com 10:05 www.crap.com 0.2
Fred cnn.com/index.htm 12:00
...
...
Efficient Evaluation Method
the answer
group-wise processing
... (identify sessions;
examine pageranks)
repartition by user
... join
repartition by url
... ...
Visits Pages
In Pig Latin
Visits = load ‘/data/visits’ as (user, url, time);
Visits = foreach Visits generate user, Canonicalize(url), time;
Pages = load ‘/data/pages’ as (url, pagerank);
VP = join Visits by url, Pages by url;
UserVisits = group VP by user;
Sessions = foreach UserVisits generate FindSessions(*);
HappyEndings = filter Sessions by BestIsLast(*);
store HappyEndings into '/data/happy_endings';
Pig Latin, in general
• transformations on sets of records
• easy for users
– high-level, extensible data processing primitives
optimization opportunities
Map-Reduce as Backend
( SQL ) user
automatic
or
rewrite + PigPig is open-source!
optimize or
http://incubator.apache.org/pig
Map-Reduce
cluster
Is Pig a DBMS?
DBMS Pig
Bulk and random reads &
workload Bulk reads & writes only
writes; indexes, transactions
programming
style System of constraints Sequence of steps
• Interactive shell
• Script file
• Embed in host language (e.g., Java)
• soon: Graphical editor
Pig Pen
Development Environment
Pig Pen Environment
• Data catalog
– What data is available
– Metadata: schemas, provenance, etc.
• Boxes-and-arrows dataflow
programming
Transform
to (user, Canonicalize(url), time)
Join
url = url
Group
by user
Transform
to (user, Average(pagerank) as avgPR)
Filter
avgPR > 0.5
Load Load
Visits(user, url, time) Pages(url, pagerank)
Group
by user
Transform
to (user, Average(pagerank) as avgPR)
(Amy, 0.65)
(Fred, 0.4)
Filter
avgPR > 0.5
(Amy, 0.65)
Generating Example Data
• Objectives:
– Realism
– Conciseness
– Completeness
• Challenges:
– Large original data
– Selective operators (e.g., join, filter)
– Noninvertible operators (e.g., UDFs)
Talk Outline
Parser
parsed Pig Latin
program program
Pig Compiler
cross-job output
optimizer
execution f( )
plan
MR Compiler
join
map-red.
jobs filter
Map-Reduce
cluster X Y
Key Issue: Redundant Work
• Popular tables
– web crawl
– search log
• Popular transformations
– eliminate spam pages
– group pages by host
– join web crawl with search log
Job 1 Job 2
Job 1 Schedule A:
ScanCC
Scan Join AA
&B
Scan C Scan
execution
engine A A
B B
C D
Worker 1 Worker 2
Related Work
• FS & DB data placement techniques
– model-based, random or round-robin
– some incorporate fault-tolerance considerations
• Runtime optimizations
– amortize work across related programs
Credits