Pig: Web-Scale Processing, Yahoo Research

pig web-scale
processing
Christopher Olston and many others

Yahoo! Research
Motivation
Motivation (details)
• Projects increasingly revolve around
analysis of big data sets
– Extracting structured data, e.g. face detection
– Understanding complex large-scale phenomena
• social systems (e.g. user-generated web content)
• economic systems (e.g. advertising markets)
• computer systems (e.g. web search engines)
• Data analysis is “inner loop” at Yahoo! et al.

• Big data necessitates parallel processing
Examples
• Detect faces
– You have a function detectFaces()
– You want to run it over n images
– n is big
• Study web usage

– You have a web crawl and click log
– Find sessions that end with the “best” page
Existing Work
• Parallel architectures
– cluster computing
– multi-core processors
• Data-parallel software
– parallel DBMS
– Map-Reduce, Dryad
• Data-parallel languages
– SQL
– NESL
Pig Project
• Data-parallel language (“Pig Latin”)

– Relational data manipulation primitives
– Imperative programming style
– Plug in code to customize processing
• Data-parallel software (“Pig”)

– Cross-program optimizations
Pig Latin Language
Example 1
Detect faces in many images.
I = load ‘/mydata/images’ using ImageParser() as (id, image);
F = foreach I generate id, detectFaces(image);
store F into ‘/mydata/faces’;
Example 2
Find sessions that end with the “best” page.
Visits Pages
user url time url pagerank
Amy www.cnn.com 8:00 www.cnn.com 0.9
Amy www.crap.com 8:05 www.flickr.com 0.9
Amy www.myblog.com 10:00 www.myblog.com 0.7
Amy www.flickr.com 10:05 www.crap.com 0.2
Fred cnn.com/index.htm 12:00
...
...
Efficient Evaluation Method
the answer
group-wise processing
... (identify sessions;
examine pageranks)
repartition by user
... join
repartition by url
... ...
Visits Pages
In Pig Latin
Visits = load ‘/data/visits’ as (user, url, time);
Visits = foreach Visits generate user, Canonicalize(url), time;
Pages = load ‘/data/pages’ as (url, pagerank);
VP = join Visits by url, Pages by url;
UserVisits = group VP by user;
Sessions = foreach UserVisits generate FindSessions(*);
HappyEndings = filter Sessions by BestIsLast(*);
store HappyEndings into '/data/happy_endings';
Pig Latin, in general
• transformations on sets of records
• easy for users
– high-level, extensible data processing primitives
• easy for the system

– exposes opportunities for parallelism and reuse
operators: binary operators:

• FILTER • JOIN
• FOREACH … GENERATE • COGROUP
• GROUP • UNION
Related Languages
• SQL: declarative all-in-one blocks

• NESL: lacks join, cogroup
• Map-Reduce: special case of Pig Latin
• Sawzall: rigid map-then-reduce
structure
Pig Latin vs. SQL
declarative (what, not how);

SQL
bundle many aspects into one statement
"I much prefer writing in Pig [Latin] versus SQL. The stepbystep method of
creating a program in Pig [Latin] is much cleaner and simpler to use than the
single block method of SQL. It is easier to keep track of what your variables
are, and where you are in the process of analyzing your data.”
sequence of simple steps
Pig Latin
Jasmine Novak, Engineer, Yahoo!
– closer to imperative programming
– semantic order of operations is obvious
– incremental construction
– debug by viewing intermediate results
Pig Latin vs. Map-Reduce
• Map-reduce welds together 3 primitives:
process records → create groups → process groups
a = FOREACH input GENERATE flatten(Map(*));

b = GROUP a BY $0;
c = FOREACH b GENERATE Reduce(*);
• In Pig, these primitives are: • Pig adds primitives for:

– explicit – filtering tables
– independent – projecting tables
– fully composable – combining 2 or more tables
more natural programming model
optimization opportunities
Map-Reduce as Backend
( SQL ) user
automatic
or
rewrite + PigPig is open-source!
optimize or
http://incubator.apache.org/pig
Map-Reduce
cluster
Is Pig a DBMS?
DBMS Pig
Bulk and random reads &
workload Bulk reads & writes only
writes; indexes, transactions
data System controls data format

Pigs eat anything
representation Must pre-declare schema
programming
style System of constraints Sequence of steps
customizable Custom functions second- Easy to incorporate

processing class to logic expressions custom functions
Ways to Run Pig
• Interactive shell
• Script file
• Embed in host language (e.g., Java)
• soon: Graphical editor
Pig Pen
Development Environment
Pig Pen Environment
• Data catalog
– What data is available
– Metadata: schemas, provenance, etc.
• Library of processing elements

– Pig Latin programs
– Individual processing functions
• Boxes-and-arrows dataflow
programming
• Automatic example data

Boxes-and-Arrows Example
Find users who tend to visit “good” pages.
Load Load
Visits(user, url, time) Pages(url, pagerank)
Transform
to (user, Canonicalize(url), time)
Join
url = url
Group
by user
Transform
to (user, Average(pagerank) as avgPR)
Filter
avgPR > 0.5
Load Load
Visits(user, url, time) Pages(url, pagerank)
(Amy, cnn.com, 8am)

(Amy, http://www.snails.com, 9am)
(Fred, www.snails.com/index.html, 11am)
(www.cnn.com, 0.9)
Transform (www.snails.com, 0.4)
to (user, Canonicalize(url), time)
Join
url = url
(Amy, www.cnn.com, 8am)

(Amy, www.snails.com, 9am) (Amy, www.cnn.com, 8am, 0.9)
(Fred, www.snails.com, 11am) (Amy, www.snails.com, 9am, 0.4)
(Fred, www.snails.com, 11am, 0.4)
Group
by user
(Amy, { (Amy, www.cnn.com, 8am, 0.9),

(Amy, www.snails.com, 9am, 0.4) })
(Fred, { (Fred, www.snails.com, 11am, 0.4) })
Transform
to (user, Average(pagerank) as avgPR)
(Amy, 0.65)
(Fred, 0.4)
Filter
avgPR > 0.5
(Amy, 0.65)
Generating Example Data
• Objectives:
– Realism
– Conciseness
– Completeness
• Challenges:
– Large original data
– Selective operators (e.g., join, filter)
– Noninvertible operators (e.g., UDFs)
Talk Outline
• Pig Latin language

• Pig Pen environment
• Pig system
• Cross-job optimizer
Pig System user
Parser
parsed Pig Latin
program program
Pig Compiler
cross-job output
optimizer
execution f( )
plan
MR Compiler
join
map-red.
jobs filter
Map-Reduce
cluster X Y
Key Issue: Redundant Work
• Popular tables
– web crawl
– search log
• Popular transformations
– eliminate spam pages
– group pages by host
– join web crawl with search log
• Goal: Minimize redundant work

>>
Work-Sharing Techniques
execute similar cache data cache data
jobs together transformations moves
Job 2
Job 1 Job 2 Job 1

Join
Op3 A&B
Op1 Op2 A′ A A
Op2
B B C
Op1 Worker 1 Worker 2
A
A
Executing Similar Jobs Together
execution
jobs
engine
queue
(job groups)
Optimal queue ordering policy?
Job 1 Job 2
Job 1 Schedule A:
Job 2 Job 2 Job 1

Schedule B:
• New “sharable” jobs arrive with frequency λ1, λ2

• Which schedule is best:
– If λ1 >> λ2
– If λ1 = λ2
Caching Data Transformations
• Options:
Job 2 – Cache Op2 output
– Cache Op3 output
Job 1 – Cache both
Op3 • Considerations:
– Space
Op2 – Utility
– Cost to generate
Op1
  Difficult to estimate a priori
  Can materialize “fragments”,
and learn
Caching Data Moves
jobs Job Localizer
ScanCC
Scan Join AA
&B
Scan C Scan
execution
engine A A
B B
C D
Worker 1 Worker 2
Related Work
• FS & DB data placement techniques
– model-based, random or round-robin
– some incorporate fault-tolerance considerations
• DB work on selecting materialized views

– model-based
– some combine with query optimization
• Prior data-parallel systems, e.g.,

Map-Reduce, Dryad, parallel DBMSs
– no work sharing
Summary
• Data-parallel language (“Pig Latin”)

– sequence of data transformation steps
– users can plug in custom code
• Development environment (“Pig Pen”)

– automatically generated example data
• Runtime optimizations
– amortize work across related programs
Credits
Shubham Chopra Chris Olston

Alan Gates Ben Reed
Dan Kifer Adam Silberstein
Ravi Kumar Utkarsh Srivastava
Antonio Magnaghi Andrew Tomkins
Shravan Narayanamurthy Erik Vee
Olga Natkovich Rob Weltman
interns: Parag Agarwal, Tyson Condie,

Sandeep Pandey, Ying Xu

Pig: Web-Scale Processing, Yahoo Research

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pig: Web-Scale Processing, Yahoo Research

Uploaded by

Copyright:

Available Formats

pig web-scale

Christopher Olston and many others

• Data analysis is “inner loop” at Yahoo! et al.

• Study web usage

• Data-parallel language (“Pig Latin”)

• Data-parallel software (“Pig”)

• easy for the system

operators: binary operators:

• SQL: declarative all-in-one blocks

declarative (what, not how);

a = FOREACH input GENERATE flatten(Map(*));

• In Pig, these primitives are: • Pig adds primitives for:

more natural programming model

data System controls data format

customizable Custom functions second- Easy to incorporate

• Library of processing elements

• Automatic example data

(Amy, cnn.com, 8am)

(Amy, www.cnn.com, 8am)

(Amy, { (Amy, www.cnn.com, 8am, 0.9),

• Pig Latin language

• Goal: Minimize redundant work

Job 1 Job 2 Job 1

Optimal queue ordering policy?

Job 2 Job 2 Job 1

• New “sharable” jobs arrive with frequency λ1, λ2

jobs Job Localizer

• DB work on selecting materialized views

• Prior data-parallel systems, e.g.,

• Data-parallel language (“Pig Latin”)

• Development environment (“Pig Pen”)

Shubham Chopra Chris Olston

interns: Parag Agarwal, Tyson Condie,

You might also like