You are on page 1of 29

How to Build a Recommendation Engine Using Apache Mahout

Viraj Paripatyadar
GS Lab

Contents
A recommendation problem What is a recommender Building a recommender using Mahout
Tips and tweaks

Recommender considerations

A book store
Sells books:
By various authors Of various categories On different subjects From various publishers

Readers/buyers are asked to rate Readers/buyers can provide reviews


You walk into the store (buy something for a friend)

The store owner


Asks you what:
your friend reads (already owns) your friend usually likes more

Has data on what:


his customers buy his customers rate and review

Uses a few strategies

1 - Find similar books


Depending on which books your friend has, pick books: by the same author on the same/similar subject/s in the same category from the same publication

(those with highest sales numbers)

2 - Find books with similar readership


Define some similarity
e.g. two books are as similar as the number of readers rating both of them e.g. only consider books which are more than 4 readers similar

Define some limit of relevance

Look for all books which are similar to books your friend owns Pick books from this set that you friend doesnt own

3 - Find people with similar tastes


Define some similarity
e.g. two people are as similar as the number of books they like from the same category e.g. only consider the 3 top people when ordered according to how similar they are to your friend

Define some limit of relevance

Look for users similar to your friend and see what they read Pick books which these people like and your friend doesnt own

Example data
1,101,5.0 1,102,3.0 1,103,2.5 3,101,2.5 3,104,4.0 3,105,4.5 4,106,4.0 5,101,4.0 5,102,3.0

2,101,2.0
2,102,2.5 2,103,5.0 2,104,2.0

3,107,5.0
4,101,5.0 4,103,3.0 4,104,4.5

5,103,2.0
5,104,4.0 5,105,3.5 5,106,4.0

Your friend owns three books:


Gave 5 stars to book 101 (likes hugely and talks about it all the time) Gave 3 stars to book 102 (has shown some liking to it) Gave 2.5 stars to book 103 (has read it, but didnt say bad things about it)

Now, we need to recommend for your friend books he hasnt seen

A pictorial representation
1 5 3

101

102

103

104

105

106

107

Visualize
1 5 3

101

102

103

104

105

106

107

A (slightly) bigger example


1,101,5.0 1,102,3.0 1,103,2.5 3,111,2.5 4,101,5.0 4,103,3.0 6,103,2.0 6,106,4.0 6,113,3.0

1,109,3.5
1,112,4.0 2,101,2.0 2,102,2.5 2,103,5.0 2,104,2.0 2,107,4.5 2,113,3.5 3,101,2.5 3,104,4.0 3,105,4.5

4,104,4.5
4,106,4.0 4,109,2.0 4,111,2.5 5,101,4.0 5,102,3.0 5,103,2.0 5,104,4.0 5,105,3.5 5,106,4.0 5,109,3.0

6,115,5.0
7,103,4.5 7,104,2.5 7,108,4.0 7,109,3.5 7,110,3.5 7,112,2.5 8,101,2.0 8,105,4.0 8,106,4.5 8,110,3.0

3,107,5.0
3,115,4.0

5,112,4.0
6,101,4.5

8,114,5.0
8,115,3.5

A pictorial representation
1 2 3 4

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

Clearly, not a viable option

Mahout to the rescue

What is Apache Mahout


Apache Mahout
A machine learning library Works with Apache Hadoop

Use cases:
Recommenders Clustering Classification

Recommenders in Mahout
Recommenders use data culled from user behavior Recommending using Mahout
Similarity between users or items
Expressed as a number between 0-1

Neighborhood of users/items Recommendation using this info and an algorithm


Generic Specialized

Similarity
Various algorithms:
Euclidean distance Pearson correlation Cosine measure Spearman correlation Tanimoto coefficient Log-likelyhood

Effectiveness dependent on the input data Influences running time and memory

Neighborhood
Nearest N neighborhood (say, 4):
5

3 U

Threshold neighborhood (say, > 0.8):


5

3 U

Recommender
Recommenders
Generic recommender
User based Item based

Slope-one recommender Singular Value Decomposition based Liner Interpolation based Cluster-based

Recommender rescorer Recommender evaluator

A real-life Web application


News aggregator-cum-reader
Fetches news from a news service Shows the news in a uniform UI Lets readers read, like/dislike and comment on news Link social networks and share Track user actions Derive and store preferences Generate recommendations Leverage social accounts, etc.

Make this a personalized newspaper

Overall design
Third party applications User, application data (MySQL)

REST

Phone/tablet applications

Controller API (REST)

REST

News aggregation, storage (Hbase) Preferences, Recommender (Mahout)

Web application

REST

Recommender

REST service

REST (Grizzly, Tomcat)

Fetch recommendations Input user actions

Recommender (offline, run periodically)

Database
MySQL

Input table dump

How to extract data one dimension


News article readership
10000
4299

1000

511

128

100

51

News article readership


13

10
4 4 2

1 1 2 3 4 5 6 7 Number of News Articles

How to extract data add dimensions


10000

1000

100

News article readership Topic readership

10

1 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 42 44 46 51 57 Number of News articles / Topics

How more data helps


40 35 30 25 20 15 10 5 0 0 100 No. of readers with x articles each No. of readers with x topics each

2
200 300 400 500 600 Number of news articles/topics 700 800

How more data helps


9 8

7
6 5 No. of readers with x articles each No. of readers with x topics each

4
3 2 1 0 5 25 45 65 Number of news articles/topics 85

How more data helps


3.5 3 2.5 2 1.5 1 0.5 0 95 145 195 245 295 345 Number of news articles/topics 395 No. of readers with x articles each No. of readers with x topics each

Learnings
Know thy user
Frequency of visits Preference logic wrt user

Know thy items


Should have enough items per user Maximize items per action Should have enough intersections Should not be transient

Use tweaking abilities Sharpen the saw

Questions

Thank you
viraj@gslab.com viraj.paripatyadar@gmail.com

You might also like