You are on page 1of 47

Aggrega&ng

Data with Pair RDDs


Chapter 12

201509
Course Chapters

1 Introduc&on Course Introduc&on


2 Introduc&on to Hadoop and the Hadoop Ecosystem
Introduc&on to Hadoop
3 Hadoop Architecture and HDFS
4 Impor&ng Rela&onal Data with Apache Sqoop
5 Introduc&on to Impala and Hive
Impor&ng and Modeling Structured
6 Modeling and Managing Data with Impala and Hive
Data
7 Data Formats
8 Data File Par&&oning
9 Capturing Data with Apache Flume Inges&ng Streaming Data

10 Spark Basics
11 Working with RDDs in Spark
12 Aggrega:ng Data with Pair RDDs
13 Wri&ng and Deploying Spark Applica&ons Distributed Data Processing with
14 Parallel Processing in Spark Spark
15 Spark RDD Persistence
16 Common PaDerns in Spark Data Processing
17 Spark SQL and DataFrames

18 Conclusion Course Conclusion

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-2
Aggrega&ng Data with Pair RDDs

In this chapter you will learn


How to create Pair RDDs of key-value pairs from generic RDDs
Special opera:ons available on Pair RDDs
How map-reduce algorithms are implemented in Spark

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-3
Chapter Topics

Distributed Data Processing with


Aggrega:ng Data with Pair RDDs
Spark

Key-Value Pair RDDs


Map-Reduce
Other Pair RDD Opera&ons
Conclusion
Homework: Use Pair RDDs to Join Two Datasets

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-4
Pair RDDs

Pair RDDs are a special form of RDD Pair RDD


Each element must be a key-value pair (a (key1,value1)
two-element tuple) (key2,value2)
Keys and values can be any type (key3,value3)
Why?
Use with map-reduce algorithms
Many addi&onal func&ons are available for
common data processing needs
e.g., sor&ng, joining, grouping, coun&ng, etc.

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-5
Crea&ng Pair RDDs

The rst step in most workows is to get the data into key/value form
What should the RDD should be keyed on?
What is the value?
Commonly used func:ons to create Pair RDDs
map
flatMap / flatMapValues
keyBy

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-6
Example: A Simple Pair RDD

Example: Create a Pair RDD from a tab-separated le

Python > users = sc.textFile(file) \


.map(lambda line: line.split('\t')) \
.map(lambda fields: (fields[0],fields[1]))

> val users = sc.textFile(file) \


Scala .map(line => line.split('\t')) \
.map(fields => (fields(0),fields(1)))

(user001,Fred Flintstone)
user001\tFred Flintstone
(user090,Bugs Bunny)
user090\tBugs Bunny
user111\tHarry Potter (user111,Harry Potter)

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-7
Example: Keying Web Logs by User ID

> sc.textFile(logfile) \
Python
.keyBy(lambda line: line.split(' ')[2])

Scala
> sc.textFile(logfile) \
.keyBy(line => line.split(' ')(2))
User ID
56.38.234.188 99788 "GET /KBDOC-00157.html HTTP/1.0"
56.38.234.188 99788 "GET /theme.css HTTP/1.0"
203.146.17.59 25254 "GET /KBDOC-00230.html HTTP/1.0"

(99788,56.38.234.188 99788 "GET /KBDOC-00157.html)


(99788,56.38.234.188 99788 "GET /theme.css)
(25254,203.146.17.59 25254 "GET /KBDOC-00230.html)

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-8
Ques&on 1: Pairs With Complex Values

How would you do this?


Input: a list of postal codes with la&tude and longitude
Output: postal code (key) and lat/long pair (value)

(00210,(43.005895,-71.013202))
00210 43.005895 -71.013202
00211 43.005895 -71.013202 (00211,(43.005895,-71.013202))
00212 43.005895 -71.013202
? (00212,(43.005895,-71.013202))
00213 43.005895 -71.013202 (00213,(43.005895,-71.013202))
00214 43.005895 -71.013202

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-9
Answer 1: Pairs With Complex Values

> sc.textFile(file) \
.map(lambda line: line.split()) \
.map(lambda fields: (fields[0],(fields[1],fields[2])))

> sc.textFile(file).
map(line => line.split('\t')).
map(fields => (fields(0),(fields(1),fields(2))))

(00210,(43.005895,-71.013202))
00210 43.005895 -71.013202
01014 42.170731 -72.604842 (01014,(42.170731,-72.604842))
01062 42.324232 -72.67915 (01062,(42.324232,-72.67915))
01263 42.3929 -73.228483 (01263,(42.3929,-73.228483))

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-10
Ques&on 2: Mapping Single Rows to Mul&ple Pairs (1)

How would you do this?


Input: order numbers with a list of SKUs in the order
Output: order (key) and sku (value)

Input Data Pair RDD


00001 sku010:sku933:sku022 (00001,sku010)
00002 sku912:sku331 (00001,sku933)
00003 sku888:sku022:sku010:sku594
00004 sku411
? (00001,sku022)
(00002,sku912)
(00002,sku331)
(00003,sku888)

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-11
Ques&on 2: Mapping Single Rows to Mul&ple Pairs (2)

Hint: map alone wont work

00001 sku010:sku933:sku022
00002 sku912:sku331
00003 sku888:sku022:sku010:sku594
00004 sku411

(00001,(sku010,sku933,sku022))
(00002,(sku912,sku331))
(00003,(sku888,sku022,sku010,sku594))
(00004,(sku411))

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-12
Answer 2: Mapping Single Rows to Mul&ple Pairs (1)

> sc.textFile(file)

00001 sku010:sku933:sku022
00002 sku912:sku331
00003 sku888:sku022:sku010:sku594
00004 sku411

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-13
Answer 2: Mapping Single Rows to Mul&ple Pairs (2)

> sc.textFile(file) \
.map(lambda line: line.split('\t'))

00001 sku010:sku933:sku022
00002 sku912:sku331
00003 sku888:sku022:sku010:sku594
[00001,sku010:sku933:sku022]
00004 sku411
[00002,sku912:sku331] Note that split returns
[00003,sku888:sku022:sku010:sku594] 2-element arrays, not
[00004,sku411] pairs/tuples

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-14
Answer 2: Mapping Single Rows to Mul&ple Pairs (3)

> sc.textFile(file) \
.map(lambda line: line.split('\t')) \
.map(lambda fields: (fields[0],fields[1]))

00001 sku010:sku933:sku022
00002 sku912:sku331
00003 sku888:sku022:sku010:sku594
[00001,sku010:sku933:sku022]
00004 sku411
[00002,sku912:sku331]
[00003,sku888:sku022:sku010:sku594]
(00001,sku010:sku933:sku022)
[00004,sku411] Map array elements to
(00002,sku912:sku331)
(00003,sku888:sku022:sku010:sku594) tuples to produce a
Pair RDD
(00004,sku411)

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-15
Answer 2: Mapping Single Rows to Mul&ple Pairs (4)

> sc.textFile(file) \
.map(lambda line: line.split('\t')) \
.map(lambda fields: (fields[0],fields[1]))
.flatMapValues(lambda skus: skus.split(':'))

00001 sku010:sku933:sku022
00002 sku912:sku331 (00001,sku010)
00003 sku888:sku022:sku010:sku594 (00001,sku933)
[00001,sku010:sku933:sku022]
00004 sku411 (00001,sku022)
[00002,sku912:sku331]
[00003,sku888:sku022:sku010:sku594] (00002,sku912)
(00001,sku010:sku933:sku022)
[00004,sku411] (00002,sku331)
(00002,sku912:sku331)
(00003,sku888)
(00003,sku888:sku022:sku010:sku594)

(00004,sku411)

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-16
Chapter Topics

Distributed Data Processing with


Aggrega:ng Data with Pair RDDs
Spark

Key-Value Pair RDDs


Map-Reduce
Other Pair RDD Opera&ons
Conclusion
Homework: Use Pair RDDs to Join Two Datasets

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-17
Map-Reduce

Map-reduce is a common programming model


Easily applicable to distributed processing of large data sets
Hadoop MapReduce is the major implementa:on
Somewhat limited
Each job has one Map phase, one Reduce phase
Job output is saved to les
Spark implements map-reduce with much greater exibility
Map and reduce func&ons can be interspersed
Results can be stored in memory
Opera&ons can easily be chained

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-18
Map-Reduce in Spark

Map-reduce in Spark works on Pair RDDs


Map phase
Operates on one record at a &me
Maps each record to one or more new records
e.g. map, flatMap, filter, keyBy
Reduce phase
Works on map output
Consolidates mul&ple records
e.g. reduceByKey, sortByKey, mean

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-19
Map-Reduce Example: Word Count
Result
aardvark 1
Input Data
cat 1
the cat sat on the mat mat 1
the aardvark sat on the sofa ? on 2

sat 2
sofa 1
the 4

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-20
Example: Word Count (1)

> counts = sc.textFile(file)

the cat sat on the


mat
the aardvark sat on
the sofa

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-21
Example: Word Count (2)

> counts = sc.textFile(file) \


.flatMap(lambda line: line.split())

the cat sat on the the


mat
cat
the aardvark sat on
sat
the sofa
on
the
mat
the
aardvark

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-22
Example: Word Count (3)

> counts = sc.textFile(file) \


.flatMap(lambda line: line.split()) \
.map(lambda word: (word,1)) Key-
Value
Pairs

the cat sat on the the (the, 1)


mat
cat (cat, 1)
the aardvark sat on
sat (sat, 1)
the sofa
on (on, 1)
the (the, 1)
mat (mat, 1)
the (the, 1)
aardvark (aardvark, 1)

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-23
Example: Word Count (4)

> counts = sc.textFile(file) \


.flatMap(lambda line: line.split()) \
.map(lambda word: (word,1)) \
.reduceByKey(lambda v1,v2: v1+v2)

the cat sat on the the (the, 1) (aardvark, 1)


mat (cat, 1)
cat (cat, 1)
the aardvark sat on (mat, 1)
sat (sat, 1)
the sofa
on (on, 1) (on, 2)
the (the, 1) (sat, 2)
mat (mat, 1) (sofa, 1)
the (the, 1) (the, 4)
aardvark (aardvark, 1)

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-24
ReduceByKey (1)

The func:on passed to > counts = sc.textFile(file) \


reduceByKey combines values .flatMap(lambda line: line.split()) \
from two keys .map(lambda word: (word,1)) \
.reduceByKey(lambda v1,v2: v1+v2)
Func&on must be binary

(the,1)
(cat,1)
(the,2)
(sat,1) (on,2)
(on,1) (sofa,1)
(the,1) (the,3) (mat,1)
(mat,1) (aardvark,1)
(the,4)
(the,1) (the, 4)
(aardvark,1) (cat,1)
(sat,1) (sat,2)
(on,1)
(the,1)

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-25
ReduceByKey (2)

The func:on might be called in > counts = sc.textFile(file) \


any order, therefore must be .flatMap(lambda line: line.split()) \
.map(lambda word: (word,1)) \
Commuta&ve x+y = y+x .reduceByKey(lambda v1,v2: v1+v2)
Associa&ve (x+y)+z = x+(y+z)

(the,1)
(cat,1)
(the,2)
(sat,1) (on,2)
(on,1) (sofa,1)
(the,1) (mat,1)
(mat,1) (aardvark,1)
(the,4)
(the,1) (the, 4)
(aardvark,1) (cat,1)
(the,2)
(sat,1) (sat,2)
(on,1)
(the,1)

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-26
Word Count Recap (the Scala Version)

> val counts = sc.textFile(file).


flatMap(line => line.split("\\W")).
map(word => (word,1)).
reduceByKey((v1,v2) => v1+v2)

OR

> val counts = sc.textFile(file).


flatMap(_.split("\\W")).
map((_,1)).
reduceByKey(_+_)

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-27
Why Do We Care About Coun&ng Words?

Word count is challenging over massive amounts of data


Using a single compute node would be too &me-consuming
Number of unique words could exceed available memory
Sta:s:cs are o\en simple aggregate func:ons
Distribu&ve in nature
e.g., max, min, sum, count
Map-reduce breaks complex tasks down into smaller elements which can
be executed in parallel
Many common tasks are very similar to word count
e.g., log le analysis

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-28
Chapter Topics

Distributed Data Processing with


Aggrega:ng Data with Pair RDDs
Spark

Key-Value Pair RDDs


Map-Reduce
Other Pair RDD Opera:ons
Conclusion
Homework: Use Pair RDDs to Join Two Datasets

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-29
Pair RDD Opera&ons

In addi:on to map and reduce func:ons, Spark has several opera:ons


specic to Pair RDDs
Examples
countByKey return a map with the count of occurrences of each
key
groupByKey group all the values for each key in an RDD
sortByKey sort in ascending or descending order
join return an RDD containing all pairs with matching keys from two
RDDs

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-30
Example: Pair RDD Opera&ons

(00004,sku411)
(00003,sku888)
(00001,sku010)
(00003,sku022)
(00001,sku933) e y ( l s e)
B y K i n g =F a (00003,sku010)
r t
(00001,sku022) so c end
a s (00003,sku594)
(00002,sku912)
(00002,sku912)
(00002,sku331)

(00003,sku888)

(00002,[sku912,sku331])
(00001,[sku022,sku010,sku933])
(00003,[sku888,sku022,sku010,sku594])
(00004,[sku411])

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-31
Example: Joining by Key

> movies = moviegross.join(movieyear)

RDD: moviegross RDD: movieyear


(Casablanca,$3.7M) (Casablanca,1942)
(Star Wars,$775M) (Star Wars,1977)
(Annie Hall,$38M) (Annie Hall,1977)
(Argo,$232M) (Argo,2012)

(Casablanca,($3.7M,1942))
(Star Wars,($775M,1977))
(Annie Hall,($38M,1977))
(Argo,($232M,2012))

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-32
Using Join

A common programming pa^ern


1. Map separate datasets into key-value Pair RDDs
2. Join by key
3. Map joined data into the desired format
4. Save, display, or con&nue processing

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-33
Example: Join Web Log With Knowledge Base Ar&cles (1)

weblogs
56.38.234.188 99788 "GET /KBDOC-00157.html HTTP/1.0"
56.38.234.188 99788 "GET /theme.css HTTP/1.0"
203.146.17.59 25254 "GET /KBDOC-00230.html HTTP/1.0"
221.78.60.155 45402 "GET /titanic_4000_sales.html HTTP/1.0"
65.187.255.81 14242 "GET /KBDOC-00107.html HTTP/1.0"

User ID Requested File


join
kblist
KBDOC-00157:Ronin Novelty Note 3 - Back up files
KBDOC-00230:Sorrento F33L - Transfer Contacts
KBDOC-00050:Titanic 1000 - Transfer Contacts
KBDOC-00107:MeeToo 5.0 - Transfer Contacts
KBDOC-00300:iFruit 5A overheats

Ar&cle ID Ar&cle Title

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-34
Example: Join Web Log With Knowledge Base Ar&cles (2)

Steps
1. Map separate datasets into key-value Pair RDDs
a. Map web log requests to (docid,userid)
b. Map KB Doc index to (docid,title)
2. Join by key: docid
3. Map joined data into the desired format: (userid,title)
4. Further processing: group &tles by User ID

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-35
Step 1a: Map Web Log Requests to (docid,userid)

> import re
> def getRequestDoc(s):
return re.search(r'KBDOC-[0-9]*',s).group()

> kbreqs = sc.textFile(logfile) \


.filter(lambda line: 'KBDOC-' in line) \
.map(lambda line: (getRequestDoc(line),line.split(' ')[2])) \
.distinct()

56.38.234.188 99788 "GET /KBDOC-00157.html HTTP/1.0"


56.38.234.188 99788 "GET /theme.css HTTP/1.0"
203.146.17.59 25254 "GET /KBDOC-00230.html HTTP/1.0"
221.78.60.155 45402 "GET kbreqs
/titanic_4000_sales.html HTTP/1.0"
65.187.255.81 14242 "GET /KBDOC-00107.html HTTP/1.0"
(KBDOC-00157,99788)
(KBDOC-00203,25254)
(KBDOC-00107,14242)

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-36
Step 1b: Map KB Index to (docid,title)

> kblist = sc.textFile(kblistfile) \


.map(lambda line: line.split(':')) \
.map(lambda fields: (fields[0],fields[1]))

KBDOC-00157:Ronin Novelty Note 3 - Back up files


KBDOC-00230:Sorrento F33L - Transfer Contacts
KBDOC-00050:Titanic 1000 - Transfer Contacts
KBDOC-00107:MeeToo 5.0 - Transfer Contacts
KBDOC-00206:iFruit 5A overheats

kblist
(KBDOC-00157,Ronin Novelty Note 3 - Back up files)
(KBDOC-00230,Sorrento F33L - Transfer Contacts)
(KBDOC-00050,Titanic 1000 - Transfer Contacts)
(KBDOC-00107,MeeToo 5.0 - Transfer Contacts)

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-37
Step 2: Join By Key docid

> titlereqs = kbreqs.join(kblist)

kbreqs kblist
(KBDOC-00157,99788) (KBDOC-00157,Ronin Novelty Note 3 - Back up files)
(KBDOC-00230,25254) (KBDOC-00230,Sorrento F33L - Transfer Contacts)
(KBDOC-00107,14242) (KBDOC-00050,Titanic 1000 - Transfer Contacts)
(KBDOC-00107,MeeToo 5.0 - Transfer Contacts)

(KBDOC-00157,(99788,Ronin Novelty Note 3 - Back up files))


(KBDOC-00230,(25254,Sorrento F33L - Transfer Contacts))
(KBDOC-00107,(14242,MeeToo 5.0 - Transfer Contacts))

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-38
Step 3: Map Result to Desired Format (userid,title)

> titlereqs = kbreqs.join(kblist) \


.map(lambda (docid,(userid,title)): (userid,title))

(KBDOC-00157,(99788,Ronin Novelty Note 3 - Back up files))


(KBDOC-00230,(25254,Sorrento F33L - Transfer Contacts))
(KBDOC-00107,(14242,MeeToo 5.0 - Transfer Contacts))

(99788,Ronin Novelty Note 3 - Back up files)


(25254,Sorrento F33L - Transfer Contacts)
(14242,MeeToo 5.0 - Transfer Contacts)

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-39
Step 4: Con&nue Processing Group Titles by User ID

> titlereqs = kbreqs.join(kblist) \


.map(lambda (docid,(userid,title)): (userid,title)) \
.groupByKey()

(99788,Ronin Novelty Note 3 - Back up files)


(25254,Sorrento F33L - Transfer Contacts)
(14242,MeeToo 5.0 - Transfer Contacts)

(99788,[Ronin Novelty Note 3 - Back up files,


Ronin S3 - overheating])
(25254,[Sorrento F33L - Transfer Contacts])

Note: values (14242,[MeeToo 5.0 - Transfer Contacts,


MeeToo 5.1 - Back up files,
are grouped iFruit 1 - Back up files,
into Iterables MeeToo 3.1 - Transfer Contacts])

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-40
Example Output

> for (userid,titles) in titlereqs.take(10):


print 'user id: ',userid
for title in titles: print '\t',title

user id: 99788


Ronin Novelty Note 3 - Back up files
Ronin S3 overheating (99788,[Ronin Novelty Note 3 - Back up files,
Ronin S3 - overheating])
user id: 25254
(25254,[Sorrento F33L - Transfer Contacts])
Sorrento F33L - Transfer Contacts
(14242,[MeeToo 5.0 - Transfer Contacts,
user id: 14242 MeeToo 5.1 - Back up files,
iFruit 1 - Back up files,
MeeToo 5.0 - Transfer Contacts MeeToo 3.1 - Transfer Contacts])
MeeToo 5.1 - Back up files
iFruit 1 - Back up files

MeeToo 3.1 - Transfer Contacts

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-41
Aside: Anonymous Func&on Parameters

Python and Scala pa^ern matching can help improve code readability

Python > map(lambda (docid,(userid,title)): (userid,title))

Scala > map(pair => (pair._2._1,pair._2._2))

OR

> map{case (docid,(userid,title)) => (userid,title)}

(KBDOC-00157,(99788,title)) (99788,title)
(KBDOC-00230,(25254,title)) (25254,title)
(KBDOC-00107,(14242,title)) (14242,title)

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-42
Other Pair Opera&ons

Some other pair opera:ons


keys return an RDD of just the keys, without the values
values return an RDD of just the values, without keys
lookup(key) return the value(s) for a key
leftOuterJoin, rightOuterJoin , fullOuterJoin join,
including keys dened in the leo, right or either RDD respec&vely
mapValues, flatMapValues execute a func&on on just the
values, keeping the key the same
See the PairRDDFunctions class Scaladoc for a full list

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-43
Chapter Topics

Distributed Data Processing with


Aggrega:ng Data with Pair RDDs
Spark

Key-Value Pair RDDs


Map-Reduce
Other Pair RDD Opera&ons
Conclusion
Homework: Use Pair RDDs to Join Two Datasets

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-44
Essen&al Points

Pair RDDs are a special form of RDD consis:ng of Key-Value pairs (tuples)
Spark provides several opera:ons for working with Pair RDDs
Map-reduce is a generic programming model for distributed processing
Spark implements map-reduce with Pair RDDs
Hadoop MapReduce and other implementa&ons are limited to a single
map and single reduce phase per job
Spark allows exible chaining of map and reduce opera&ons
Spark provides opera&ons to easily perform common map-reduce
algorithms like joining, sor&ng, and grouping

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-45
Chapter Topics

Distributed Data Processing with


Aggrega:ng Data with Pair RDDs
Spark

Key-Value Pair RDDs


Map-Reduce
Other Pair RDD Opera&ons
Conclusion
Homework: Use Pair RDDs to Join Two Datasets

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-46
Homework: Use Pair RDDs to Join Two Datasets

In this homework assignment you will


Con&nue exploring web server log les using key-value Pair RDDs
Join log data with user account data
Please refer to the Homework descrip:on

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-47

You might also like