You are on page 1of 7

Scalable Recommendation System for Yelp

Chhavi Choudhury, Piyush Bhargava, Sakshi Bhargava

Abstract Accurately predicting user preference is a difficult


task, however getting reasonable predictions is necessary for
myriad of industries to corroborate personalization, which
is highly useful both economically and socially. Inspired by
this, we chose to build our own recommendation system at
scale, leveraging data from RecSys2013: Yelp Business Rating
Prediction contest. Given a location and a business category,
we aimed to recommend a business to a user that they have not
reviewed before. The recommendations are based on predictions
made using Yelp review ratings, business and user features. This
is essentially a cold start problem as most of the users and
businesses in the test data are absent from the training data.
After creating our baseline model using average ratings, we
applied various modeling techniques such as Random Forest,
Logistic Regression, Linear Regression and Gradient Boosting
on features extracted from metadata to predict business ratings.
We evaluated our models and tested for combination of features
across models using 3 fold cross validation by minimizing
Mean Square Error (MSE). The best RMSE seen on test
data was 1.267 with Random Forest model which is almost
6% improvement over mean baseline of 1.39. We also created
an elementary recommendation tool for users. Building our
recommendation system in Spark ensures scalability for future
applications. In future, we would like to augment our data
(using Yelp API) to decrease sparsity and improve prediction
performance.

I. INTRODUCTION
With the explosion of social network websites and online
user-generated content platforms, there is an ever increasing demand for personalization. It also presents a strategic
opportunity for businesses to expand and enhance their
offerings. Companies such as Amazon, Netflix, and Pandora
prominently feature recommended, personalized content for
their customers; enhancing their product experience in a
significant way. We are motivated by this widespread application of recommendation systems in various industries.
We would like to leverage this opportunity to gain handson experience in this rapidly evolving field. We chose data
from a popular website, Yelp, as it is extensive and presents
a challenging recommendation problem.
We aimed to create an elementary recommendation system
for Yelp. We followed the Kaggle competition, RecSys2013:
Yelp Business Rating Prediction and learned from the best
algorithms in the competition. Using collective knowledge
about users, businesses and user ratings (indicators of user
preference), we created a model to predict the rating that
a Yelp user would give to a business. We have validated
our model on the test data provided in the competition. The
most accurate user preferences were then used to recommend
similar businesses to a user for a chosen category and zip

code. We implemented our algorithm in Apache Spark, thus


making the recommendation system scalable.
II. DATA
The recommendation system was trained on the Yelp
dataset from RecSys Challenge 2013: Yelp Business Rating
prediction competition hosted by Kaggle. The data available
on Kaggle is a detailed dump of Yelp reviews, businesses,
users, and checkins for the Phoenix, AZ metropolitan area.
The data is in json format. The following information is
available in the training set:
Business - The business data provides metadata about
various businesses. The information mainly includes
location, business category, number of reviews for the
business and its average star rating.
Review - Reviews data provides the star rating given by
the user, text review and votes given by various users
for a particular review.
User - The user data provides average star ratings,
number of reviews and number of votes for reviews
given by users on Yelp platform.
Check-in - The check-in data provides check-in information of users at various hours for the business they
reviewed.
The test data has similar format, however, proxies for
user preference such as business average rating, user average
rating, user votes and review text are missing, which makes
the prediction task very challenging. This makes this problem
essentially a cold start problem.
In any recommendation problem, we can see four cases
of known/unknown user and business information. The distribution of these cases is highlighted in Table 1.
TABLE I
K NOWN /U NKNOWN U SER AND B USINESS SEGMENTS IN T EST DATA

Known Business
Unknown Business

Known User
33.2%
11.2%

Unknown User
14.1%
14.5%

III. E XPLORATORY DATA A NALYSIS


The key findings from our Exploratory Data Analysis pertain to the sparsity of the data. We have about 230K reviews,
45K users and 11K businesses in the training data. The
distributions of number of reviews per user and reviews per
business was extremely right-skewed, and the medians was
only 7 and 5 reviews, respectively. This indicates that the data
is extremely sparse, and finding relationships between users
and business using reviews in the data would be difficult.

Hence, separate analysis was performed to understand the


three crucial pieces in this problem.
A. Analysis on Reviews
Our target variable; review star rating; has a left skewed
distribution; i.e. most of the user ratings are either 4 or 5.
From this we may conclude that the users mostly rate a
business when they like it. The global average rating from
reviews in training data was observed to be 3.76.

Distribution of Review Stars


80000
70000

Counts

60000

This is a potential threat for the success of using Collaborative Filtering approach, since it requires decent
user history for similarity calculations. Also a large
number of users have only one business in common,
which makes their similarity 1 regardless of what their
ratings are.
The number of votes for each user in the useful,
funny and cool category are inclined towards the useful
category comprising of nearly 50% of the total votes.
In addition, there is strong multicollinearity between
the categories of votes; all three correlations between
pairs of categories are above 0.95. Thus, we were better
off eliminating or combining these variables rather than
risking presence of multicollinearity affect the models
that are sensitive to it.

50000

IV. BASELINE R ESULTS

40000
30000
20000
10000
0

Review Stars
Fig. 1.

Distribution of Review Stars

Our baseline model takes into account the mean ratings of


the users and the mean ratings of the businesses to predict
ratings. We notate these as follows: is the average rating
for all user average ratings included in the training set ( =
3.7667). u is the vector containing average ratings for each
user and subtracting . b is the vector containing average
is the predicted
ratings for each business and subtracting . R
ratings matrix. We then use the following equation to predict
user is rating for business j [1] :
ij = + ui + bj
R

B. Analysis on Businesses
There are 14,334 businesses in the training and test
datasets combined. However, 2,797 businesses in the test
data have missing star rating information. These correspond
to almost 25% of reviews in the test data. This poses a severe
problem as we predict ratings for recommendations.
Most businesses have a rating of 3, 3.5, or 4. Low (1, 2)
and high ratings (5) are very rare in the dataset. These
lower ratings are said to be sandbag ratings, which
account for a good portion of error in our predictions.
(Fig. 1)
Business categories pose a major difficulty for us as
there are total 549 categories for businesses across the
train and test data. Also, each business can be mapped
to multiple categories. The number of categories associated with the business ranges from 0-10, while
almost 65% businesses have 3 or 4 categories associated
to them. The dummy variables extracted from these
categories are thus very sparse.
C. Analysis on Users
There are 55,503 users in combined train and test data,
out of which nearly 9500 users have no average star ratings
and vote counts. These correspond to 30% of reviews in test
data. This is again an indicator of the sparsity of the data,
making this problem mostly a cold start one.
Median review count of the users is 7 which indicates
that most of the users have written very few reviews.

Using this predictor we scored a Kaggle RMSE of 1.39.


V. S INGULAR VALUE D ECOMPOSITION USING ALS
Singular Value Decomposition (SVD) is a latent factor
method popularized by its massive success in the Netflix
Prize competition. A matrix is approximated with SVD by
multiplying two generated feature matrices P and Q with
rank k.
ij = Pi QT +
R
j
For example Rij corresponds to the rating for useri and
businessj . We then minimize the squared differences between the ratings matrix R and the SVD approximated
We also subtract the global average rating, ,
matrix, R.
to make the SVD prediction more accurate[1] .
P
T
2

arg max
i,jR (Rij Pi Qj + )
P,Q

We also include a weighted lambda regularization


(Tikhonov regularization) to avoid overfitting to our training
set.
P
2
2
T
2

arg max
i,jR (Rij Pi Qj + ) + (kPi k + kQj k )
P,Q

Our best results for this method were found to have a rank
(k) of 100 and a regularization constant of 0.3. We could
not run more than 25 iterations as that is a limitation of the
version of Spark we are using. Our SVD with ALS method
scored an RMSE of 1.2907.

TABLE II
F EATURES EXTRACTED FROM T RAINING DATA SETS

User

Check-ins

7000
6000

FEATURES
Votes
Categories
Longitude
Latitude
Average Stars
Review count
Open
Zip Code
Review count
Average Stars
Gender
Weekday early mrng.
Weekday morning
Weekday midday
Weekday afternoon
Weekday evening
Weekday night
Weekend early mrng.
Weekend morning
Weekend midday
Weekend afternoon
Weekend evening
Weekend night

DESCRIPTION
Cool, Funny, Useful
Categories assigned to business
Longitude of business
Latitude of business
Average star rating of business
Number of reviews received
If the business is in operation
Zip code of the business
Number of reviews written
Average star rating by user
Gender of the user
Check-ins between 12AM-7AM
Check-ins between 7AM-11AM
Check-ins between 11AM-2PM
Check-ins between 2PM-5PM
Check-ins between 5PM-9PM
Check-ins between 9PM-12AM
Check-ins between 12AM-7AM
Check-ins between 7AM-11AM
Check-ins between 11AM-2PM
Check-ins between 2PM-5PM
Check-ins between 5PM-9PM
Check-ins between 9PM-12AM

Number of Checkins by Average Business Rating


Rating 2.0
Rating 2.5
Rating 3.0
Rating 3.5

Rating 4.0
Rating 4.5
Rating 5.0

Number of Checkins

5000

Votes - Votes categorized as Useful, Cool and Funny


were available in the Review data set. They were
assigned to Users and Businesses to have additional
feature available for prediction.

A. Clustering of Geographic Location


Location of business can highly influence its popularity.
The location attributes like crime rate, population etc. will
remain constant for businesses in the same location. This was
the major reason to cluster businesses belonging to the same
neighborhood together using their geographical co-ordinates
(latitude and longitude) and zip codes given in business
data. Businesses were clustered by running K-means on both
features. To determine the number of clusters(K),we created
a scree plot between K and mean distance to centroid.
0.12

Scree Plot of K Means Clustering (Latitude-Longitude)

0.11
Mean Distance to Centroid

DATASET
Review
Business

0.10
0.09
0.08
0.07
0.06

4000

0.05
2

3000

10

12

14

16

Number of Clusters

2000
Fig. 3.

K-Means Scree Plot

1000
0
0

20

40

60

80

100

120

140

160

180

Weekly hours from Sunday Early Morning to Saturday Night

Fig. 2.

Time Series Plot of Total Check-Ins

Since both the features provide essentially the same location information, we chose the one with the best 3-fold
cross validation results with the models. From scree plot,
best number of clusters to use was 7 for geographical coordinates
B. Street Direction

VI. F EATURE E NGINEERING


The features we used are summarized in Table II. In
addition to features directly available in the data sets, we
created new features from the data. Prominent examples
include:
Business categories - A business can belong to multiple
categories. Hence, dummy variables were created for
each of the categories.
Check-Ins - Hourly check-in volumes for businesses
were grouped together into broader categories to have
meaningful volumes. Total Check-Ins are significantly
higher for businesses with average ratings between 3
to 4 than for the businesses with extreme ratings. In
absence of average business stars for size-able test
cases, this was a useful feature [2] .

Street direction can give a sense of the impact of the neighborhood/locality attributes of a business on its popularity.
With this hypothesis, the direction of the street for business
was extracted from its address as N, S, E and W.
C. User Gender
Though our dataset does not contain user gender, we can
simply guess the gender of the user from their first name. We
used a publicly hosted dataset which maps common names
to gender. We were able to map most of the names in our
dataset[1] .
D. Grouped Category Averages
Each business belonged to multiple categories. Out of
all the categories tagged to a business, there might be a
category/department hosted by a business which is more

VII. M ACHINE L EARNING


Various Machine learning methods were tried on the data
to extract information from metadata, which is otherwise
not possible using techniques like matrix factorization and
collaborative filtering.
A. Segmentation of Test Review data
Five segments of test Review data were created based
on the availability of Users and Businesses information.
This was done to capture maximum features for prediction of ratings for respective segments and improve overall
accuracy. For instance, Segment 1 consists of those test
reviews whose user and business information is available
in respective training data sets. Since training data sets
have more feature information available, this segment had
maximum features available for modeling. Hence, the MSE
was also lowest for this segment for all the methods. On
the other hand, Segment 5 consists of those reviews whose
user information is unavailable and business information is
available only in test Business data. The five segments
along with size are listed in Table III.
TABLE III
T EST R EVIEW DATA SET SEGMENTS
SEGMENT
Segment 1
Segment 2
Segment 3
Segment 4
Segment 5

DESCRIPTION
Both, User and Business, exist
in respective Training sets
Only Business exists in
respective Training set
Only User exists
in respective Training set
Both, User and Business, exist
in respective Test sets
Only Business exists in
respective Test set

SIZE
12,078
14,951

MSE was primarily achieved by varying depth of trees once


number of trees reached close to 15.
The cross-validation results for each segment can be seen
in Figure 4. The MSEs achieved are significantly different for
each segment which justifies the use of segmented approach.
As expected, Segment 1 with maximum available features
has the lowest MSE of 0.93. Segment 2 and 3 perform
equally well with a MSE of 1.18. Segments 1, 2 and 3
achieve their best MSE at a depth of 10 after which the
MSE starts to increase due to over-fitting. Segments 4 and
5 require deeper trees to achieve better accuracy due to lack
of significant features.
The overall random forest regressor built using the tuned
parameters resulted in an RMSE of 1.267 on the test data
set which is better than 1.39 achieved through baseline
model. Random Forest Regressor from the mllib library
in Spark was used for modeling purpose.

Cross Validation across segments

1.5
1.4
MSE - Mean Squared Error

popular over the others. To account for this, we computed


grouped category average as:
Business1 > [Food, Ice Cream and Frozen Yogurt]
Grouped average > average(mean rating for Food, mean
rating for Ice Cream and Frozen Yogurt)

1.3
1.2
Segment 1
Segment 2
Segment 3
Segment 4
Segment 5

1.1
1.0
0.9
2

10

12

14

16

Maximum depth of trees

4,086
4,767

Fig. 4.

Cross Validation - Random forest models

522

B. Random Forest Regression


Random forest (RF) operates by constructing a multitude
of uncorrelated decision trees at training time and outputting
the mean prediction of rating (in this case) of the individual
trees.
The initial prototype of RF was developed on complete
training data set. But the MSE was only slightly better than
Global Mean Baseline since plenty of features had to be
excluded due to their unavailability for certain test cases.
Post segmentation, a different RF model was built for each
segment. For building each tree within a model, a random
subset (m) of maximum predictors (p) was considered. Going

by the best practice, m = p was used. 3-fold cross validation was performed for each segment to tune the number
of trees and the maximum depth of each tree. The other
parameters were kept at their default settings. Reduction in

The features had varying impact on the reduction in MSE


for each segment. Figure 5 shows the impact of sequentially
including User, Business and extracted features on MSE of
Segment 1.
C. Gradient Boosted Trees
Gradient Boosted Trees (GBT) operate by iteratively building a sequence of predictor trees, and the final predictor is
a weighted average of these predictor trees. At each step,
the focus is on adding an incremental tree that improves
the performance of the entire ensemble Compared to RF, it
is reasonable to use smaller trees in GBT, with just a few
terminal nodes.
As with RF, a different model was built for each segment.
3-fold cross validation was performed for each segment to
tune the Loss type (log-loss and least squares), the number of
iterations/trees and the maximum depth of trees. Reduction in
MSE was primarily achieved by considering least squares
error and by varying the depth of trees.

Features impact waterfall


1.5

1.48

0.01%

Mean Squared Error

0.30%
1.0

0.26%

0.93

0.5

- learning rate and number of iterations for the stochastic


gradient descent and the optimal regularization parameter
- L1(Lasso)/L2(Ridge) were selected based on 3-fold cross
validation results.
Yelp rating predictions for the test reviews using this linear
regression with stochastic gradient resulted in an RMSE
of 1.29. The results were quite inferior to other methods.
Although model training happened really fast, it was mainly
tuning the parameters that consumed a lot of time.
TABLE IV
S ELECTION OF B EST M ODEL
Method
Random Forest
Gradient Boosting
Matrix Factorisation
Logistic Regression
Linear Regression
Global Mean Baseline

0.0

Segment
1
Baseline
MSE
(Mean)

Fig. 5.

Including
original
User
features

Including
original
Business
features

Including
original
and
extracted
features

Segment
1
Final
MSE

Features Impact Waterfall - Random forest - Segment 1

The overall RMSE achieved through GBT was 1.28 on


test data set which is better than 1.39 achieved through
baseline model but slightly inferior to RFs RMSE. This
was probably because GBT are more prone to overfitting.
Also, GBT took much longer to train that RF because GBT
train one tree at a time while RF can train multiple trees in
parallel. GradientBoostedTrees from the mllib library in
Spark was used to build models.

VIII. C HALLENGES

D. Logistic Regression
For this multi-class classification problem, logistic regression was adapted by building multiple One Vs All binary
classifiers for each of the 5 classes i.e. 5 ratings. The test
observations were assigned the rating that had the highest
predicted probability from the binary classifiers.
3-fold cross-validation was performed to tune the number
of iterations and to select the optimal regularization parameter - L1(Lasso)/L2(Ridge). Optimal regularization parameter
was L2 for Segments 1, 2 and 3, and L1 for the remaining
segments.
Using this logistic regression classifier we got a Kaggle
RMSE of 1.30. The results were quite inferior to other
methods. One of the limitations of applying Logistic Regression in Spark was that the predictions were in the form of
classes and not raw probabilities. This could have inflated the
RMSE. The run time for training the classifier though was
significantly lower than other methods. LogisticRegressionWithLBFGS from the mllib library in Spark was used,
an optimization algorithm which operates using a limited
amount of computer memory.
E. Linear Regression
Predicting ratings can be seen as a linear regression
problem, and hence linear regression was attempted in Spark
to predict the ratings. In mllib library in spark, linear
regression model was built using stochastic gradient descent
using LinearRegressionWithSGD. The various parameters

RMSE
1.267
1.278
1.2907
1.298
1.29
1.390

The unavailability of python bindings in Spark for few


of the machine learning techniques like Principal Component Analysis for reducing dimension of categories,
hierarchical clustering for clustering businesses based
on locations etc. which we wanted to use for feature
engineering posed as a great problem. Also, since Spark
is not as established as Python, it was occasionally
a challenge to solve problems due to limited online
resources.
We attempted Collaborative Filtering on the data, but
since the ratings are very sparse, i.e. most of the users
just have 1 or 2 businesses in common, the similarity
mostly come out to be 1, irrespective of the magnitude
of the rating. Also since this algorithm is O(n2 ) it was
not possible to implement it in a scalable way.
IX. C ONCLUSIONS

Machine learning techniques such as Random Forest work


really well for sparse data in recommendation problem as
if we can leverage metadata and extract information from
it. Hence we need not only use global mean to predictive
ratings in the cases where either user or business is unknown.
By using features extracted from the data, ratings for these
segments can be predicted with higher accuracy that global
mean.
X. F UTURE W ORK
A personalized recommendation system can be built using
the final model. Personalization can be done based on user
history, the category of business the user is interested in and
zipcode/location the user is interested in. We have created a
prototype using a Python package called spyre. This allows
us to create an interface where one can choose user name,
category and zip code. Based on these selection criteria, we
can recommend a business which the user is most likely to
rate highly based on this past review history.

R EFERENCES
[1] Naomi Carrillo, Idan Elmaleh, Rheanna Gallego, Zack Kloock, Irene
Ng, Jocelyne Perez, Michael Schwinger, Ryan Shiroma, Recommender
Systems Designed for Yelp.com.
[2] Exploring the Yelp Data Set: Extracting Useful Features with Text
Mining and Exploring Regression Techniques for Count Data.
[3] http://bryangregory.com/kaggle-data-science-competition-recaprecsys2013-yelp-business-rating-prediction/

open: True / False (corresponds


to permanently closed),

APPENDIX
Training Data Format
Business
{
type: business,
business_id: (encrypted
business id),
name: (business name),
neighborhoods: [(neighboorhood
names)],
full_address: (local address),
city: (city),
state: (state),
latitude: latitude,
longitude: longitude,
stars: (star rating, rounded to
half-stars),
review_count: review count,
categories: [(localized category
names)]

}
Review
{
type: review,
business_id: (encrypted
business id),
user_id: (encrypted user id),
stars: (star rating),
text: (review text),
date: (date, formatted
like 2012-03-14, %Y-%m-%d in
strptime notation),
votes: {useful: (count),
funny: (count), cool: (count)}
}
User
{
type: user,
user_id: (encrypted user id),
name: (first name),
review_count: (review count),
average_stars: (floating point
average, like 4.31),
votes: {useful: (count),
funny: (count), cool: (count)}
}

Check-In
{
type: checkin,
business_id: (encrypted business id),
checkin_info: {
0-0: (number of checkins from
00:00 to 01:00 on all Sundays),
1-0: (number of checkins from
01:00 to 02:00 on all Sundays),
...
14-4: (number of checkins from
14:00 to 15:00 on all Thursdays),
...
23-6: (number of checkins from
23:00 to 00:00 on all Saturdays)
}
}

You might also like