Professional Documents
Culture Documents
I. INTRODUCTION
With the explosion of social network websites and online
user-generated content platforms, there is an ever increasing demand for personalization. It also presents a strategic
opportunity for businesses to expand and enhance their
offerings. Companies such as Amazon, Netflix, and Pandora
prominently feature recommended, personalized content for
their customers; enhancing their product experience in a
significant way. We are motivated by this widespread application of recommendation systems in various industries.
We would like to leverage this opportunity to gain handson experience in this rapidly evolving field. We chose data
from a popular website, Yelp, as it is extensive and presents
a challenging recommendation problem.
We aimed to create an elementary recommendation system
for Yelp. We followed the Kaggle competition, RecSys2013:
Yelp Business Rating Prediction and learned from the best
algorithms in the competition. Using collective knowledge
about users, businesses and user ratings (indicators of user
preference), we created a model to predict the rating that
a Yelp user would give to a business. We have validated
our model on the test data provided in the competition. The
most accurate user preferences were then used to recommend
similar businesses to a user for a chosen category and zip
Known Business
Unknown Business
Known User
33.2%
11.2%
Unknown User
14.1%
14.5%
Counts
60000
This is a potential threat for the success of using Collaborative Filtering approach, since it requires decent
user history for similarity calculations. Also a large
number of users have only one business in common,
which makes their similarity 1 regardless of what their
ratings are.
The number of votes for each user in the useful,
funny and cool category are inclined towards the useful
category comprising of nearly 50% of the total votes.
In addition, there is strong multicollinearity between
the categories of votes; all three correlations between
pairs of categories are above 0.95. Thus, we were better
off eliminating or combining these variables rather than
risking presence of multicollinearity affect the models
that are sensitive to it.
50000
40000
30000
20000
10000
0
Review Stars
Fig. 1.
B. Analysis on Businesses
There are 14,334 businesses in the training and test
datasets combined. However, 2,797 businesses in the test
data have missing star rating information. These correspond
to almost 25% of reviews in the test data. This poses a severe
problem as we predict ratings for recommendations.
Most businesses have a rating of 3, 3.5, or 4. Low (1, 2)
and high ratings (5) are very rare in the dataset. These
lower ratings are said to be sandbag ratings, which
account for a good portion of error in our predictions.
(Fig. 1)
Business categories pose a major difficulty for us as
there are total 549 categories for businesses across the
train and test data. Also, each business can be mapped
to multiple categories. The number of categories associated with the business ranges from 0-10, while
almost 65% businesses have 3 or 4 categories associated
to them. The dummy variables extracted from these
categories are thus very sparse.
C. Analysis on Users
There are 55,503 users in combined train and test data,
out of which nearly 9500 users have no average star ratings
and vote counts. These correspond to 30% of reviews in test
data. This is again an indicator of the sparsity of the data,
making this problem mostly a cold start one.
Median review count of the users is 7 which indicates
that most of the users have written very few reviews.
arg max
i,jR (Rij Pi Qj + )
P,Q
arg max
i,jR (Rij Pi Qj + ) + (kPi k + kQj k )
P,Q
Our best results for this method were found to have a rank
(k) of 100 and a regularization constant of 0.3. We could
not run more than 25 iterations as that is a limitation of the
version of Spark we are using. Our SVD with ALS method
scored an RMSE of 1.2907.
TABLE II
F EATURES EXTRACTED FROM T RAINING DATA SETS
User
Check-ins
7000
6000
FEATURES
Votes
Categories
Longitude
Latitude
Average Stars
Review count
Open
Zip Code
Review count
Average Stars
Gender
Weekday early mrng.
Weekday morning
Weekday midday
Weekday afternoon
Weekday evening
Weekday night
Weekend early mrng.
Weekend morning
Weekend midday
Weekend afternoon
Weekend evening
Weekend night
DESCRIPTION
Cool, Funny, Useful
Categories assigned to business
Longitude of business
Latitude of business
Average star rating of business
Number of reviews received
If the business is in operation
Zip code of the business
Number of reviews written
Average star rating by user
Gender of the user
Check-ins between 12AM-7AM
Check-ins between 7AM-11AM
Check-ins between 11AM-2PM
Check-ins between 2PM-5PM
Check-ins between 5PM-9PM
Check-ins between 9PM-12AM
Check-ins between 12AM-7AM
Check-ins between 7AM-11AM
Check-ins between 11AM-2PM
Check-ins between 2PM-5PM
Check-ins between 5PM-9PM
Check-ins between 9PM-12AM
Rating 4.0
Rating 4.5
Rating 5.0
Number of Checkins
5000
0.11
Mean Distance to Centroid
DATASET
Review
Business
0.10
0.09
0.08
0.07
0.06
4000
0.05
2
3000
10
12
14
16
Number of Clusters
2000
Fig. 3.
1000
0
0
20
40
60
80
100
120
140
160
180
Fig. 2.
Since both the features provide essentially the same location information, we chose the one with the best 3-fold
cross validation results with the models. From scree plot,
best number of clusters to use was 7 for geographical coordinates
B. Street Direction
Street direction can give a sense of the impact of the neighborhood/locality attributes of a business on its popularity.
With this hypothesis, the direction of the street for business
was extracted from its address as N, S, E and W.
C. User Gender
Though our dataset does not contain user gender, we can
simply guess the gender of the user from their first name. We
used a publicly hosted dataset which maps common names
to gender. We were able to map most of the names in our
dataset[1] .
D. Grouped Category Averages
Each business belonged to multiple categories. Out of
all the categories tagged to a business, there might be a
category/department hosted by a business which is more
DESCRIPTION
Both, User and Business, exist
in respective Training sets
Only Business exists in
respective Training set
Only User exists
in respective Training set
Both, User and Business, exist
in respective Test sets
Only Business exists in
respective Test set
SIZE
12,078
14,951
1.5
1.4
MSE - Mean Squared Error
1.3
1.2
Segment 1
Segment 2
Segment 3
Segment 4
Segment 5
1.1
1.0
0.9
2
10
12
14
16
4,086
4,767
Fig. 4.
522
by the best practice, m = p was used. 3-fold cross validation was performed for each segment to tune the number
of trees and the maximum depth of each tree. The other
parameters were kept at their default settings. Reduction in
1.48
0.01%
0.30%
1.0
0.26%
0.93
0.5
0.0
Segment
1
Baseline
MSE
(Mean)
Fig. 5.
Including
original
User
features
Including
original
Business
features
Including
original
and
extracted
features
Segment
1
Final
MSE
VIII. C HALLENGES
D. Logistic Regression
For this multi-class classification problem, logistic regression was adapted by building multiple One Vs All binary
classifiers for each of the 5 classes i.e. 5 ratings. The test
observations were assigned the rating that had the highest
predicted probability from the binary classifiers.
3-fold cross-validation was performed to tune the number
of iterations and to select the optimal regularization parameter - L1(Lasso)/L2(Ridge). Optimal regularization parameter
was L2 for Segments 1, 2 and 3, and L1 for the remaining
segments.
Using this logistic regression classifier we got a Kaggle
RMSE of 1.30. The results were quite inferior to other
methods. One of the limitations of applying Logistic Regression in Spark was that the predictions were in the form of
classes and not raw probabilities. This could have inflated the
RMSE. The run time for training the classifier though was
significantly lower than other methods. LogisticRegressionWithLBFGS from the mllib library in Spark was used,
an optimization algorithm which operates using a limited
amount of computer memory.
E. Linear Regression
Predicting ratings can be seen as a linear regression
problem, and hence linear regression was attempted in Spark
to predict the ratings. In mllib library in spark, linear
regression model was built using stochastic gradient descent
using LinearRegressionWithSGD. The various parameters
RMSE
1.267
1.278
1.2907
1.298
1.29
1.390
R EFERENCES
[1] Naomi Carrillo, Idan Elmaleh, Rheanna Gallego, Zack Kloock, Irene
Ng, Jocelyne Perez, Michael Schwinger, Ryan Shiroma, Recommender
Systems Designed for Yelp.com.
[2] Exploring the Yelp Data Set: Extracting Useful Features with Text
Mining and Exploring Regression Techniques for Count Data.
[3] http://bryangregory.com/kaggle-data-science-competition-recaprecsys2013-yelp-business-rating-prediction/
APPENDIX
Training Data Format
Business
{
type: business,
business_id: (encrypted
business id),
name: (business name),
neighborhoods: [(neighboorhood
names)],
full_address: (local address),
city: (city),
state: (state),
latitude: latitude,
longitude: longitude,
stars: (star rating, rounded to
half-stars),
review_count: review count,
categories: [(localized category
names)]
}
Review
{
type: review,
business_id: (encrypted
business id),
user_id: (encrypted user id),
stars: (star rating),
text: (review text),
date: (date, formatted
like 2012-03-14, %Y-%m-%d in
strptime notation),
votes: {useful: (count),
funny: (count), cool: (count)}
}
User
{
type: user,
user_id: (encrypted user id),
name: (first name),
review_count: (review count),
average_stars: (floating point
average, like 4.31),
votes: {useful: (count),
funny: (count), cool: (count)}
}
Check-In
{
type: checkin,
business_id: (encrypted business id),
checkin_info: {
0-0: (number of checkins from
00:00 to 01:00 on all Sundays),
1-0: (number of checkins from
01:00 to 02:00 on all Sundays),
...
14-4: (number of checkins from
14:00 to 15:00 on all Thursdays),
...
23-6: (number of checkins from
23:00 to 00:00 on all Saturdays)
}
}