You are on page 1of 33

pig web-scale

processing

Christopher Olston and many others


Yahoo! Research
Motivation
Motivation (details)
• Projects increasingly revolve around
analysis of big data sets
– Extracting structured data, e.g. face detection
– Understanding complex large-scale phenomena
• social systems (e.g. user-generated web content)
• economic systems (e.g. advertising markets)
• computer systems (e.g. web search engines)

• Data analysis is “inner loop” at Yahoo! et al.


• Big data necessitates parallel processing
Examples
• Detect faces
– You have a function detectFaces()
– You want to run it over n images
– n is big

• Study web usage


– You have a web crawl and click log
– Find sessions that end with the “best” page
Existing Work
• Parallel architectures
– cluster computing
– multi-core processors
• Data-parallel software
– parallel DBMS
– Map-Reduce, Dryad
• Data-parallel languages
– SQL
– NESL
Pig Project

• Data-parallel language (“Pig Latin”)


– Relational data manipulation primitives
– Imperative programming style
– Plug in code to customize processing

• Data-parallel software (“Pig”)


– Cross-program optimizations
Pig Latin Language
Example 1
Detect faces in many images.

I = load ‘/mydata/images’ using ImageParser() as (id, image);
F = foreach I generate id, detectFaces(image);
store F into ‘/mydata/faces’;
Example 2
Find sessions that end with the “best” page.

Visits Pages
user url time url pagerank
Amy www.cnn.com 8:00 www.cnn.com 0.9
Amy www.crap.com 8:05 www.flickr.com 0.9
Amy www.myblog.com 10:00 www.myblog.com 0.7
Amy www.flickr.com 10:05 www.crap.com 0.2
Fred cnn.com/index.htm 12:00

...
...
Efficient Evaluation Method
the answer

group-wise processing
... (identify sessions;
examine pageranks)
repartition by user
... join

repartition by url

... ...
Visits Pages
In Pig Latin

      Visits = load ‘/data/visits’ as (user, url, time);
      Visits = foreach Visits generate user, Canonicalize(url), time;

       Pages = load ‘/data/pages’ as (url, pagerank);

          VP = join Visits by url, Pages by url;
  UserVisits = group VP by user;
    Sessions = foreach UserVisits generate FindSessions(*);
HappyEndings = filter Sessions by BestIsLast(*);

       store HappyEndings into '/data/happy_endings';
Pig Latin, in general
• transformations on sets of records
• easy for users
– high-level, extensible data processing primitives

• easy for the system


– exposes opportunities for parallelism and reuse

operators: binary operators:


• FILTER • JOIN
• FOREACH … GENERATE • COGROUP
• GROUP • UNION
Related Languages

• SQL: declarative all-in-one blocks


• NESL: lacks join, cogroup
• Map-Reduce: special case of Pig Latin
• Sawzall: rigid map-then-reduce
structure
Pig Latin vs. SQL

declarative (what, not how);


SQL
bundle many aspects into one statement
"I much prefer writing in Pig [Latin] versus SQL. The step­by­step method of
creating a program in Pig [Latin] is much cleaner and simpler to use than the 
single block method of SQL. It is easier to keep track of what your variables 
are, and where you are in the process of analyzing your data.” 
sequence of simple steps
Pig Latin
­­ Jasmine Novak, Engineer, Yahoo!
– closer to imperative programming
– semantic order of operations is obvious
– incremental construction
– debug by viewing intermediate results
Pig Latin vs. Map-Reduce
• Map-reduce welds together 3 primitives:
process records → create groups → process groups

a = FOREACH input GENERATE flatten(Map(*));


b = GROUP a BY $0;
c = FOREACH b GENERATE Reduce(*);

• In Pig, these primitives are: • Pig adds primitives for:


– explicit – filtering tables
– independent – projecting tables
– fully composable – combining 2 or more tables

more natural programming model

optimization opportunities
Map-Reduce as Backend
( SQL ) user

automatic
or
rewrite + PigPig is open-source!
optimize or
http://incubator.apache.org/pig

Map-Reduce

cluster
Is Pig a DBMS?
DBMS Pig
Bulk and random reads &
workload Bulk reads & writes only
writes; indexes, transactions

data System controls data format


Pigs eat anything
representation Must pre-declare schema

programming
style System of constraints Sequence of steps

customizable Custom functions second- Easy to incorporate


processing class to logic expressions custom functions
Ways to Run Pig

• Interactive shell
• Script file
• Embed in host language (e.g., Java)
• soon: Graphical editor
Pig Pen
Development Environment
Pig Pen Environment
• Data catalog
– What data is available
– Metadata: schemas, provenance, etc.

• Library of processing elements


– Pig Latin programs
– Individual processing functions

• Boxes-and-arrows dataflow
programming

• Automatic example data


Boxes-and-Arrows Example
Find users who tend to visit “good” pages.
Load Load
Visits(user, url, time) Pages(url, pagerank)

Transform
to (user, Canonicalize(url), time)
Join
url = url

Group
by user

Transform
to (user, Average(pagerank) as avgPR)

Filter
avgPR > 0.5
Load Load
Visits(user, url, time) Pages(url, pagerank)

(Amy, cnn.com, 8am)


(Amy, http://www.snails.com, 9am)
(Fred, www.snails.com/index.html, 11am)
(www.cnn.com, 0.9)
Transform (www.snails.com, 0.4)
to (user, Canonicalize(url), time)
Join
url = url

(Amy, www.cnn.com, 8am)


(Amy, www.snails.com, 9am) (Amy, www.cnn.com, 8am, 0.9)
(Fred, www.snails.com, 11am) (Amy, www.snails.com, 9am, 0.4)
(Fred, www.snails.com, 11am, 0.4)

Group
by user

(Amy, { (Amy, www.cnn.com, 8am, 0.9),


(Amy, www.snails.com, 9am, 0.4) })
(Fred, { (Fred, www.snails.com, 11am, 0.4) })

Transform
to (user, Average(pagerank) as avgPR)

(Amy, 0.65)
(Fred, 0.4)

Filter
avgPR > 0.5

(Amy, 0.65)
Generating Example Data
• Objectives:
– Realism
– Conciseness
– Completeness

• Challenges:
– Large original data
– Selective operators (e.g., join, filter)
– Noninvertible operators (e.g., UDFs)
Talk Outline

• Pig Latin language


• Pig Pen environment
• Pig system
• Cross-job optimizer
Pig System user

Parser
parsed Pig Latin
program program
Pig Compiler
cross-job output
optimizer
execution f( )
plan
MR Compiler
join
map-red.
jobs filter
Map-Reduce

cluster X Y
Key Issue: Redundant Work
• Popular tables
– web crawl
– search log

• Popular transformations
– eliminate spam pages
– group pages by host
– join web crawl with search log

• Goal: Minimize redundant work


>>
Work-Sharing Techniques
execute similar cache data cache data
jobs together transformations moves
Job 2

Job 1 Job 2 Job 1


Join
Op3 A&B
Op1 Op2 A′ A A
Op2
B B C
Op1 Worker 1 Worker 2
A
A
Executing Similar Jobs Together
execution
jobs
engine
queue
(job groups)

Optimal queue ordering policy?

Job 1 Job 2
Job 1 Schedule A:

Job 2 Job 2 Job 1


Schedule B:

• New “sharable” jobs arrive with frequency λ1, λ2


• Which schedule is best:
– If λ1 >> λ2
– If λ1 = λ2
Caching Data Transformations
• Options:
Job 2 – Cache Op2 output
– Cache Op3 output
Job 1 – Cache both
Op3 • Considerations:
– Space
Op2 – Utility
– Cost to generate
Op1
  Difficult to estimate a priori
  Can materialize “fragments”,
and learn
Caching Data Moves

jobs Job Localizer

ScanCC
Scan Join AA
&B
Scan C Scan
execution
engine A A
B B
C D
Worker 1 Worker 2
Related Work
• FS & DB data placement techniques
– model-based, random or round-robin
– some incorporate fault-tolerance considerations

• DB work on selecting materialized views


– model-based
– some combine with query optimization

• Prior data-parallel systems, e.g.,


Map-Reduce, Dryad, parallel DBMSs
– no work sharing
Summary

• Data-parallel language (“Pig Latin”)


– sequence of data transformation steps
– users can plug in custom code

• Development environment (“Pig Pen”)


– automatically generated example data

• Runtime optimizations
– amortize work across related programs
Credits

Shubham Chopra Chris Olston


Alan Gates Ben Reed
Dan Kifer Adam Silberstein
Ravi Kumar Utkarsh Srivastava
Antonio Magnaghi Andrew Tomkins
Shravan Narayanamurthy Erik Vee
Olga Natkovich Rob Weltman

interns: Parag Agarwal, Tyson Condie,


Sandeep Pandey, Ying Xu

You might also like