You are on page 1of 8

Yelp Business Rating Prediction

Rahul
Venkataraman

Tamilmani Manoharan

Venkatakrishnan

tmanoharan@cs.stonybrook.edu
Rajagopalan
rvenkatarama@cs.stonybrook.edu
110385545
verajagopala@cs.stonybrook.edu
110368788
110455765

ABSTRACT

2.

Yelp has an enormous amount of data on a variety of


local businesses. Yelp only provides the aggregated rating for each business. The rich information about the
businesses, users and reviews can be used for predicting how a restaurants reviews have been changing, how
various features of a business affect its ratings, how the
behaviour of users impact the ratings etc. There are
various studies regarding how the varrious features of
a business affect the ratings and how much review text
has an impact on the ratings a business receives. In this
project, we investigate various features of Yelp data to
build models for business rating prediction. We have
considered both the features as well as the review text
to generate models for business rating prediction.

In the Semantic feature analysis and mining for yelp


prediction paper, they have tried to predict review rating based on average business and user rating, review
count and number of votes features. They have initially
taken average rating as a feature and calculated train
MSE and test MSE. The MSE got reduced when they
added features like review count, vote count along with
average rating. Also they have done topic modeling
and calculated scores for each review . They have combined this feature with the above mentioned features to
predict the review rating which minimized MSE error
further. We found this idea to be meaningful and have
planned to integrate this idea into our prediction model.
In the Inferring Future Business Attention paper,
they have taken into account the review text to predict
the future ratings a business will receive. They have
used the methods of sentiment analysis and keywordopinion extraction. We have planned to go ahead with
the approach of sentiment analysis to generate feature
vectors from the review text in order to predict the review rating. In the paper Data Mining Yelp Data - Predicting rating stars from review text, they have used
the methods of Latent Dirchlet Allocation(LDA) and
Term FrequencyInverse Document Frequency(TFIDF)
to predict the rating. In addition to this model, we
have planned to combine the text features along with
the other features as explained above. We have also
planned to extract features using LDA as well.

Keywords
Yelp, Ratings, Prediction, Feature Extraction, Linear
Regression, NLP

1.

INTRODUCTION

Yelp, Inc. is a company that operates a social networking, user review, and local search web site of the
same name. Over 31 million people access Yelps website each month, putting it in the top 150 of U.S. Internet web sites. The company also trains small businesses
in how to respond to reviews, hosts social events for reviewers, and provides data about businesses, including
health inspection scores.
The goal of our project is to predict the ratings of
various businesses given their features, reviews and the
data about the users who have written the reviews. We
have approached this problem in two different ways:
i) Predict the ratings of different businesses with their
given aggregated features and the features we have extracted for each business by combining user and review
data with the business data.
ii) Learn how the ratings of a particular business has
been changing over time and predict how it will change
in the future.

3.

PRIOR WORK

DATA DESCRIPTION

Yelp provides a snapshot of their enormous amount


of data as Yelp Dataset Challenge. This set includes
information about local businesses in 10 cities across 4
countries. The dataset is split into 5 .json files, each
one representing a different category : business data,
checkin data, user data, review data and tips provided
by users to improve the businesses.
All the user related data have a unique user id, business data have a business id and the other data such as
checkin, reviews and tips connect the user and business
using the unique user id and business id. The overall
size of the data is around 1.64GB.

For our project we have used three datasets namely


yelp academic dataset review.json,
yelp academic dataset user.json and
yelp academic dataset business.json which are the Review, User and Business datasets respectively. Some
details about the datasets are as follows

Table 1: Dataset Description


Dataset
Rows
Size
User
366,715
166.2MB
Business
61,184
55.4MB
Review 1,578,264 1.4GB

The review table includes information about each review. Specifically, it contains business id, user id, stars(A
star rating on a scale of 1-5), text (The raw review text),
data, votes(The number of useful, funny or cool).
The user table consists of user id, name, review count,
average stars (Average rating on a scale of 1-5 made by
the user), votes(the total number of votes for reviews
made by this user). Business table contains details
about business including id, name, neighborhoods, address and geographic information, stars, reviews count
(The total number of reviews about this business), categories (a list of category tags for this business), and
other attributes.

Figure 1: Rating Stars distribution


The plot of stars vs counts as shown above indicates
that more than half of the reviews are positive i.e they
are rated 4 or 5. Even the third highest count is 3 stars
which may indicate that users only tend to rate a place
if they really like it.

Data Preparation and Cleaning


We have used an open source application from GitHub
called jsontocsv that converts a .json file to a .csv file.
The application also converts nested attributes into separate columns in the csv.
Also, when the feature set did not include the review
text, we separated the text attribute from the review
dataset to reduce the size of the dataset and also to improve execution time. We extracted only the features
we wanted and combined them with the features extracted from the review text to minimize the load on
the machine and gained faster execution time.

4.

EXPLORATORY DATA ANALYSIS

Since there are a lot of businesses with very few reviews, we have planned to work on data that are heavily
reviewed. In order to get rich data with sufficient attributes to perform prediction of business ratings, EDA
was performed to find out the categories that have the
most reviews and extract out only those particular data
that are heavily reviewed. As explained in the Problem
Statement, any business with insufficient reviews will
not be helpful in any analysis.

Figure 2: Businesses of cities distribution


The no.of categories that are present more than a
thousand times in the business column are very less. We
can see that restaurant category is present in more than
20k reviews which accounts for 1/3rd of the business
data. Other categories are present in relatively very
lesser numbers, so to enrich the data and as a preliminary step, we have take only the data under Restaurant
category. After EDA, the data was spliced and refined
based on only the most reviewed and useful content in
order to minimize the outliers and to perform the prediction analysis effectively.

5.

POLYNOMIAL FIT OF REVIEW STARS

After EDA we found that 3 businesses are reviewed


more than 1000 times in the Review dataset. Only for
these 3 businesses we sorted the reviews acquired by
increasing time and tried to apply a fit to see whether
the reviews followed any common pattern.

Table 3:
Fit
Linear
Quadratic
Polynomial 4th
Polynomial 6th

Figure 3: Highest reviewed business

R2 values for various fits


Business 1 Business 2 Business 3
0.0099
0.0007
0.00065
0.012
0.0073
0.00065
0.015
0.013
0.0048
0.022
0.014
0.0067

This means that none of the models fit our data. We


need more features or a different model to predict the
future review stars a business might obtain.

6.

LINEAR REGRESSION ON FEATURES

In this section, we intend to investigate how each feature of users, businesses, reviews influences rating stars
of a business.
Linear regression models the target Y as a linear funcion of the feature variables Xj , a bias term () and
regulation term :
Y =+
Figure 4: 2nd Highest reviewed business

wi Xi +

The coefficients (wi ) are what the training procedure


learns. Each model coefficient describes the expected
weight of in- fluence in the target variable associated
with feature.
Intuitively, the coefficients often tell an interesting
story of how much each feature matters in predicting
target values. the bias term indicates the average target
value.
For example, in the business rating on Yelp, the value
of coefficient shows strength of the feature and the sign
of coefficient (positive or negative) indicate direction of
association to final rating.
Figure 5: 3rd Highest reviewed business

R-squared is a statistical measure of how close the


data are to the fitted regression line. It is also known
as the coefficient of determination, or the coefficient of
multiple determination for multiple regression.
R-squared = Explained variation / Total variation
R-squared is always between 0 and 100%. In general,
the higher the R-squared, the better the model fits your
data.

Table 2: Mean Squared Error


Business 1 Business 2 Business 3
MSE 1.2
1.1
0.87

Exploratory Rating Prediction


We aggregated and combined the user and review data
to run Linear Regression on various combination of features and checked how they performed while predicting
the business rating stars. The features were split into
train and test data and from each of the train and test
datasets, all the features were given as the data and the
stars column was given as the result. We use Pythons
Scikit Learn package sklearn to perform Linear Regression. We tried for different splits of test and training
data and achieved similar results.
A few feature combinations that we used are All user
attributes, Cool, funny, useful votes of review, Average stars given by users, Cool, funny, useful votes
received by users and so on. The MSE values for few
of the features are as follows:

Table 4: Root Mean


Feature Set
all user features
cool,funny,useful review
avg star user
cool,funny,useful user

Square Error Values


train MSE
test MSE
1.262155615 1.289070039
1.719629405 1.724435433
1.328749457 1.330678437
1.814434709 1.837739733

The mean squared error function computes mean square


error, a risk metric corresponding to the expected value
of the squared (quadratic) error loss or loss. If yi is
the predicted value of the i-th sample, and yi is the
corresponding true value, then the mean squared error
(MSE) estimated over nsamples is defined as

M SE(y, y) =

1
nsamples

nsamples 1

(yi yi )

i=0

In short, the lesser the MSE value, the better is our


model and the closer the fit is to the data. Without
any feature analysis, the raw featurees give MSE values which are very high (> 1). Hence, we decided to
investigate the features further and perform prediction
on only those features that highly impact the business
ratings.

7.

FEATURE ANALYSIS

In order to find features which best predict the business rating, we tried to find the correlation between different features of a business and the aggregated rating
a business has received. We selected and extracted the
following features and tried to find the co-relation with
business rating:
i) Average rating of all the users who reviewed a particular business so far ii) Review Count iii) Useful votes
iv) Cool votes v) Funny votes

Figure 7: Correlation Matrix Visualization 2


From the above corelation plot, we could deduce that
average rating of users feature is highly corelated with
business stars when compared to other features. so we
have initially taken average rating of users as a primary
feature and tried to predict business rating. Later we
added other features and analyzed whether MSE is getting reduced.
We tried regression with many combination of features with the user average rating. But only the category average(i.e average rating of the particular category such as Italian, Chinese etc), review count and
cool/funny/useful votes gave good results.
Table 5: Root Mean Square Error Values
Feature Set
train MSE
test MSE
avg user rating
0.3311218195 0.343218655
avg user rating,
0.5421097938 0.53110551
category average
avg user rating,
reviewcount,
0.2941391645 0.3005636476
cool/funny/useful
From the above table it is clear that these features
perform much better than what the initial set of features before finding out the correlation performed. This
model has MSE as low as 0.3 which means the model is
a good fit for predicting the business star ratings.
Table 6: Average of Predicted Rating
Actual Rating Count Avg of predicted Rating
5
88866 4.727753151
4
78000 3.897875831
3
37588 3.511951464
2
24438 2.882491759
1
30886 2.437493715

Figure 6: Correlation Matrix Visualization 1

From the EDA we know that more than half of the


reviews are positive i.e they are rated 4 or 5. Even the
third highest count is 3 stars which may indicate that
users only tend to rate a place if they really like it.

The above table shows that the model does well on


ratings that were between 3 and 5 but not too well on
ratings 1 and 2. One reason why this could happen is
that the number of 4 and 5 star reviews is more than
twice the number of reviews with 1-3 stars as we have
seen in the EDA and the model prediction tends to have
less deviation than real rating

8.

GENERATING REVIEW TEXT FEATURES

There is still room for improvement. So we decided


to use Review text as a feature in addition to aboe
features to predict the business star ratings. Review
text is one of the most important and rich components
of the Yelp dataset. It can be processed and used in
a variety of ways for a regression model. Some of the
most common methods used to process review text are
n-gram analysis, keyword associations and extractions
and sentiment analysis.

Frequently Occuring KeyWords


Initially, we have taken the approach followed by Bryan
Hood, Victor Hwang and Jennifer King in their paper
Inferring Future Business Attention. The first method
we tried was to mine the most frequently occurring
keywords among all restaurant reviews. We then used
counts of the sentiments for each of these keywords as
features.
We use the below algorithm to produce a feature vector containing the counts of positive and negative words
present in each review. This is a two-step process: In
the first step, we compute the top keywords among all
the restaurant reviews. After we extract the top keywords, for each review, we then count the number of
times these keywords occur. We used the Python Natural Language Toolkit (NLTK) for this process.
Steps:
The first step is to clean the review text. This step
involves spell checking and removing the punctuation characters.
Next, we tokenize each review into sentences and
each sentence into words. Then we remove the
tokens which are stopwords.
Then, we part-of-speech tag (pos) each token and
associate them with a speech tag. The Penn TreeBank Corpus was used to tokenize and pos-tag the
review text.
From the features produced, we only care about
the adjectives (labeled starting with JJ). We ran
the above algorithm for all the reviews of all the
businesses coming under a specific category (i.e.
restaurants).
The 25 most frequently used words were extracted
using this technique.

The next step is to construct the feature vector for individual reviews taking the count of these most frequent
words in them. Then we aggregate the word counts of
each word based on the business id of the businesses.
Then we combine this feature vector with the average
review stars for each business and perform linear regression on the feature vector with the average review stars
as the result feature.
Table 7: MSE for Top words feature vector
Feature Set
train MSE
test MSE
Top words frequency 0.4681823533 0.4656203843
This model performs worse than our intial feature
vector. Hence we decided to try other models for review
text feature analysis.

tf-idf
Term frequency will just find the frequency of word in a
document rather than finding the importance of word.
Inorder to compute the score based on its importance
we have chosen tfidf method. tfidf, is the short form
for term frequencyinverse document frequency.
tfidf predicts how important a word is to a document in a collection or corpus. It gives more weight to
words that appears less frequently across all documents.
For this, we have aggregated all reviews for a business
and created separate document for each business review.
We got around 2500 business review documents. Then
we have selected around 15 words to find tf-idf score
for each word particular to a business. Using python
NLTK library, we computed tfidf scores for all words
specific to a business. Then using these tfidf scores
of each word as a feature, we used linear regression to
predict the accuracy of business stars.
Table 8: MSE for tf-idf feature vector
Feature Set train MSE
test MSE
tf-idf
0.07837244585 0.1047954418
This model produced excellent results with MSE as
low as 0.1. Each run of 50,000 rows of reviews took
around 3 hours and so due to the heavy nature of this
algorithm we were able to run this algorithm only for a
subset of the review dataset - 50,000 reviews out of 1.5
million reviews. This may be a reason why this model
performs so well.
So we decided to run the model for increased number of rows to check how the model performed with
more data being given to it. After the initial run of
about 50,000 rows that contained information about
650+ restaurants, we ran the algorithm for about 150,000
reviews. This contained information about more than
a 1000 restaurants. Finally we ran the algorithm for
400,000 reviews and this had information about more
than 1700 restaurants.

Table 9: MSE for tf-idf feature vector


#Reviews train MSE
test MSE
50000
0.07837244585 0.1047954418
150000
0.05933798492 0.06423109019
400000
0.06115728106 0.06005993051
Initially we chose the most frequently used words to
find tfidf score. Since most reviews are positive, the
chosen words are mostly positive and the accuracy of
business rating prediction has been affected. So we split
the reviews into two parts. The first one contains highly
rated reviews(4-5 stars) and the second part contains
low rated reviews(1-2 stars). We have chosen most frequent 10 words from the first part and most frequent 5
words from the second part.
The reviews for each business has been aggregated
and then these reviews has been created as a separate
document and placed in specific directory. we have created around 2500 documents. To generate tf-idf model,
we considered only 1500 documents due to memory and
processing constraints. Then we iterate through every
file in that directory, converting the text to lowercase
and removing punctuation. Then these reviews are tokenized and preprocessed using NLTKs tokenizer and
porter stemmer and then they are passed to tfidfvectorizer method which generates matrix of tfidf vectors.
After generating these tf-idf score matrix, then for
each individual review we found tf-idf score for the chosen 15 words. We took the average the tf-idf score for
each business and the tf-idf score of these 15 words are
considered as feature vectors and then passed to linear
regression model to predict business star rating.
It is clearly evident from Table 9 and the low MSE
that this model is a very good predictor of the star
ratings of a restaurant. Even though we had initially
thought that the low MSE values were because of the
lesser number of rows taken as a sample, the MSE did
not increase but rather slightly decreased after increasing the size of the input rows/reviews.
Table 10: Average of Predicted Rating
Actual Rating Count
Avg of predicted Rating
5
116911 4.882645761
4
106044 4.145951575
3
65632
3.082441265
2
52482
2.172148637
1
58930
1.343798834
In comparison with Table 6 and Table 9, it is evident that the average of predicted rating under tfidf
is closer to the actual rating than what was obtained
from the other features. But still there is some difference found mostly because of the skewed distribution of
the number of votes received for each rating from 1 to
5. From tables 9 and 10, tfidf seems to be the best
model to predict the ratings.

LDA
Latent Dirichlet Allocation (LDA)
After experimenting with frequent words and tf-idf, we
decided to test topic modeling and check whether it has
any impact on predicting the accuracy of business rating. Inorder to accomplish this, we studied about topic
modeling algorithms and finally decided LDA over Latent Semantic Indexing(LSI). Eventhough LSI is faster
than LDA, the latter gives more accuracy than former.
LDA represents documents as mixture of topics that
spit out words with certain probabilities. The business
review texts are filtered using nltk stop words and then
tokenized to different words. The words are then converted to a dictionary which contains word-id mappings.
Each word in the corpus will be assigned with an unique
id. As gensim uses Bag of Words(BoW) Representation,
the dictionary word-id mappings are then converted to
Bag Of Words format. The function doc2bow() in gensim simply counts the number of occurrences of each
distinct word to word id and returns the results as a
sparse vector(word id and their count).
Instead of finding the frequency of word, we planned
to incorporate tf-idf before passing to LDA model. The
BoW sparse vectors are then transformed to tf-idf vector using tf-idf gensim model. The tf-idf model expects
BoW vectors as input and returns a vector with same
dimensionality of tfidf valued weights.These tf-idf valued vectors are given to LDA model to generate reuired
number of topics.
The number of topics to be generated can be specified
in gensim LDA model. We chose the number of topics
as 10. Then using this model, we took each review from
the dataframe and computed how much it was related to
each topic generated by LDA model. We computed this
score for around 150,000 reviews and took the average
score for each business id. The ten topics generated
by LDA are represented as feature vectors and using
these feature vectors and we tried to predict business
rating using Linear Regression. We used python gensim
package to perform topic modeling using LDA.
Table 11: LDA MSE
#Businesses 2916
Train MSE
0.380210916
Test MSE
0.393376747
Though this model performs better than frequent words
model, it is on par with the business features model but
performs worse than the tfidf model.

9.

TIME SERIES ANALYSIS

The AR(autoregressive model) specifies that the output variable depends linearly on its own previous values and on a stochastic term. A moving-average(MA)
model is conceptually a linear regression of the current
value of the series against current and previous (unob-

served) white noise error terms or random shocks. The


random shocks at each point are assumed to be mutually independent and to come from the same distribution, typically a normal distribution, with location at
zero and constant scale. An ARMA model is a combination of the two. We decided to try out AR model
initially because of the lack of a deeper understanding
of the MA model.
For time series analysis, we extracted the top 3 most
reviewed businesses ratings and their respective dates
separately. Then, the records were sorted based on the
date column in ascending order i.e now the dataset for
each of the 3 businesses only had two columns, date in
ascending order and the corresponding rating received.
We used the statsmodels package in Python for AR
prediction and fitting.
We initially tried to fit the entire data in AR model
and tried to predict the ratings for the entire time. Then
we also tried by splitting the first 70-80% of the data
into training and the rest was used as the test dataa.
The split had to be in the same order as the data because the the output variable depends linearly on its
own previous values. Then the actual ratings and the
predicted ratings were sent to the mean squared error
function to measure their accuracy.
Table
Business
Business
Business

12: AR MSE
1 0.4306942204
2 0.4868993014
3 0.4729743146

As we can infer from the table, this model gives reasonably high MSE when compared to the other models.
We suspect that the main reason behind this is the fact
that the change in the ratings of a business over time
depends on a variety of different factors related to the
business and its surroundings, not just its previous ratings. This may be a reason why models including other
features of a business perform better than this model.
Due to time constraints, we were unable to proceed further with Time Series Analysis. We hope TSA will perform better if we incorporate time dependent features
instead of just the previous ratings.

10.

CROSS-FOLD VALIDATION

We validated the models that gave us the best results


using cross-fold validation.
Learning the parameters of a prediction function and
testing it on the same data is a methodological mistake:
a model that would just repeat the labels of the samples
that it has just seen would have a perfect score but
would fail to predict anything useful on yet-unseen data.
This situation is called overfitting.
A solution to this problem is a procedure called crossvalidation(CV). A test set should still be held out for final evaluation, but the validation set is no longer needed
when doing CV. In the basic approach, called k-fold CV,

the training set is split into k smaller sets.


- A model is trained using k-1 of the folds as training
data
- the resulting model is validated on the remaining part
of the data
The performance measure reported by k-fold cross-validation
is then the average of the values computed in the loop.
We used the cross validation package from scikit
learn and the method cross val score to calculate the
score. This method took minimum four arguments: the
object to use to fit the data, original data, the predicted
data and cv where cv is cross-validation generator or an
iterable i.e cv determines the cross-validation splitting
strategy by specifying the number of folds.
Table 13: Cross Fold Score
model
cross fold score
avg user rating
Accuracy: 0.37 (+/avg user rating,
reviewcount,
Accuracy: 0.36 (+/cool/funny/useful
most freq words
Accuracy: 0.08 (+/tf-idf(653 businesses)
Accuracy: 0.59 (+/tf-idf(1327 businesses) Accuracy: 0.67 (+/tf-idfl(1777 bsinesses)
Accuracy: 0.71 (+/-

0.09)
0.11)
0.13)
0.13)
0.15)
0.10)

The accuracy is the mean of the returne scores variable i.e scores.mean() and the +/- value is scores.std()
* 2. We tried this for various different valued of cv and
did not observe much change. As expected from our
analysis so far, tf-idf performs better than the other
models and accuracy seems to increase for more number of reviews analysed.

11.

CONCLUSION

The motivation for this project was to come up with


good method to predict a business star rating from its
features. This has a lot of potential applications such
as deteermining the features that contribute to a good
review, fraud review detection, given a business and its
features what can be added to improve ratings etc.
In this report, we have discussed our approach which
involved the generation of various corelated features for
linear regression prediction, review text feature generation and Time series analysis. Several feature extraction methods such as term frequency classifier, Latent
Dirichlet Allocation (LDA) and TF-IDF have been used
and evaluated.
avguserrating, reviewcount, cool/funny/useful
votes were the features that best predicted the business
rating. tf-idf model produced the best results than
LDA and term frequency among review text based feature predictions.

12.

FUTURE WORK

In our project, we generated business features and review test features separately and evaluated them. But
we did not test with any combined feature matrix including the both. We can combine the business features
and review text feature matrix and test how the combined set of features predict the ratings of a business.
Also, due to time constraints, we could not carry out
Time Series Analysis completely. The performed analysis using AR model did not give very good results.
We can improve this if we incorporate time dependent
features instead of just the previous ratings. We can
generate time based features such as total reviews so
far, cool funny useful votes so far etc. This can be used
in the Time Series Analysis to predict how the ratings
of a business would change in the future given the time
dependent features generated with ratings received so
far.

13.

REFERENCES

[1] Bryan Hood, Victor Hwang and Jennifer King,


Inferring Future Business Attention, 2013,
retrieved from
http://www.yelp.com/dataset challenge.
[2] Wael Farhan, Predicting Yelp Restaurant
Reviews, UCSD.
[3] Yinshi Zhang, Semantic Feature Analysis and
Mining for Yelp Rating Prediction, UCSD.
[4] Rakesh Chada, Chetan Naik, Data Mining Yelp
Data - Predicting rating stars from review text,
Stony Brook University.
[5] J. Huang, S. Rogers, and E. Joo. Improving
restaurants by extracting subtopics from yelp
reviews, 2014.
[6] https://radimrehurek.com/gensim/tut2.html

You might also like