Professional Documents
Culture Documents
Rahul
Venkataraman
Tamilmani Manoharan
Venkatakrishnan
tmanoharan@cs.stonybrook.edu
Rajagopalan
rvenkatarama@cs.stonybrook.edu
110385545
verajagopala@cs.stonybrook.edu
110368788
110455765
ABSTRACT
2.
Keywords
Yelp, Ratings, Prediction, Feature Extraction, Linear
Regression, NLP
1.
INTRODUCTION
Yelp, Inc. is a company that operates a social networking, user review, and local search web site of the
same name. Over 31 million people access Yelps website each month, putting it in the top 150 of U.S. Internet web sites. The company also trains small businesses
in how to respond to reviews, hosts social events for reviewers, and provides data about businesses, including
health inspection scores.
The goal of our project is to predict the ratings of
various businesses given their features, reviews and the
data about the users who have written the reviews. We
have approached this problem in two different ways:
i) Predict the ratings of different businesses with their
given aggregated features and the features we have extracted for each business by combining user and review
data with the business data.
ii) Learn how the ratings of a particular business has
been changing over time and predict how it will change
in the future.
3.
PRIOR WORK
DATA DESCRIPTION
The review table includes information about each review. Specifically, it contains business id, user id, stars(A
star rating on a scale of 1-5), text (The raw review text),
data, votes(The number of useful, funny or cool).
The user table consists of user id, name, review count,
average stars (Average rating on a scale of 1-5 made by
the user), votes(the total number of votes for reviews
made by this user). Business table contains details
about business including id, name, neighborhoods, address and geographic information, stars, reviews count
(The total number of reviews about this business), categories (a list of category tags for this business), and
other attributes.
4.
Since there are a lot of businesses with very few reviews, we have planned to work on data that are heavily
reviewed. In order to get rich data with sufficient attributes to perform prediction of business ratings, EDA
was performed to find out the categories that have the
most reviews and extract out only those particular data
that are heavily reviewed. As explained in the Problem
Statement, any business with insufficient reviews will
not be helpful in any analysis.
5.
Table 3:
Fit
Linear
Quadratic
Polynomial 4th
Polynomial 6th
6.
In this section, we intend to investigate how each feature of users, businesses, reviews influences rating stars
of a business.
Linear regression models the target Y as a linear funcion of the feature variables Xj , a bias term () and
regulation term :
Y =+
Figure 4: 2nd Highest reviewed business
wi Xi +
M SE(y, y) =
1
nsamples
nsamples 1
(yi yi )
i=0
7.
FEATURE ANALYSIS
In order to find features which best predict the business rating, we tried to find the correlation between different features of a business and the aggregated rating
a business has received. We selected and extracted the
following features and tried to find the co-relation with
business rating:
i) Average rating of all the users who reviewed a particular business so far ii) Review Count iii) Useful votes
iv) Cool votes v) Funny votes
8.
The next step is to construct the feature vector for individual reviews taking the count of these most frequent
words in them. Then we aggregate the word counts of
each word based on the business id of the businesses.
Then we combine this feature vector with the average
review stars for each business and perform linear regression on the feature vector with the average review stars
as the result feature.
Table 7: MSE for Top words feature vector
Feature Set
train MSE
test MSE
Top words frequency 0.4681823533 0.4656203843
This model performs worse than our intial feature
vector. Hence we decided to try other models for review
text feature analysis.
tf-idf
Term frequency will just find the frequency of word in a
document rather than finding the importance of word.
Inorder to compute the score based on its importance
we have chosen tfidf method. tfidf, is the short form
for term frequencyinverse document frequency.
tfidf predicts how important a word is to a document in a collection or corpus. It gives more weight to
words that appears less frequently across all documents.
For this, we have aggregated all reviews for a business
and created separate document for each business review.
We got around 2500 business review documents. Then
we have selected around 15 words to find tf-idf score
for each word particular to a business. Using python
NLTK library, we computed tfidf scores for all words
specific to a business. Then using these tfidf scores
of each word as a feature, we used linear regression to
predict the accuracy of business stars.
Table 8: MSE for tf-idf feature vector
Feature Set train MSE
test MSE
tf-idf
0.07837244585 0.1047954418
This model produced excellent results with MSE as
low as 0.1. Each run of 50,000 rows of reviews took
around 3 hours and so due to the heavy nature of this
algorithm we were able to run this algorithm only for a
subset of the review dataset - 50,000 reviews out of 1.5
million reviews. This may be a reason why this model
performs so well.
So we decided to run the model for increased number of rows to check how the model performed with
more data being given to it. After the initial run of
about 50,000 rows that contained information about
650+ restaurants, we ran the algorithm for about 150,000
reviews. This contained information about more than
a 1000 restaurants. Finally we ran the algorithm for
400,000 reviews and this had information about more
than 1700 restaurants.
LDA
Latent Dirichlet Allocation (LDA)
After experimenting with frequent words and tf-idf, we
decided to test topic modeling and check whether it has
any impact on predicting the accuracy of business rating. Inorder to accomplish this, we studied about topic
modeling algorithms and finally decided LDA over Latent Semantic Indexing(LSI). Eventhough LSI is faster
than LDA, the latter gives more accuracy than former.
LDA represents documents as mixture of topics that
spit out words with certain probabilities. The business
review texts are filtered using nltk stop words and then
tokenized to different words. The words are then converted to a dictionary which contains word-id mappings.
Each word in the corpus will be assigned with an unique
id. As gensim uses Bag of Words(BoW) Representation,
the dictionary word-id mappings are then converted to
Bag Of Words format. The function doc2bow() in gensim simply counts the number of occurrences of each
distinct word to word id and returns the results as a
sparse vector(word id and their count).
Instead of finding the frequency of word, we planned
to incorporate tf-idf before passing to LDA model. The
BoW sparse vectors are then transformed to tf-idf vector using tf-idf gensim model. The tf-idf model expects
BoW vectors as input and returns a vector with same
dimensionality of tfidf valued weights.These tf-idf valued vectors are given to LDA model to generate reuired
number of topics.
The number of topics to be generated can be specified
in gensim LDA model. We chose the number of topics
as 10. Then using this model, we took each review from
the dataframe and computed how much it was related to
each topic generated by LDA model. We computed this
score for around 150,000 reviews and took the average
score for each business id. The ten topics generated
by LDA are represented as feature vectors and using
these feature vectors and we tried to predict business
rating using Linear Regression. We used python gensim
package to perform topic modeling using LDA.
Table 11: LDA MSE
#Businesses 2916
Train MSE
0.380210916
Test MSE
0.393376747
Though this model performs better than frequent words
model, it is on par with the business features model but
performs worse than the tfidf model.
9.
The AR(autoregressive model) specifies that the output variable depends linearly on its own previous values and on a stochastic term. A moving-average(MA)
model is conceptually a linear regression of the current
value of the series against current and previous (unob-
12: AR MSE
1 0.4306942204
2 0.4868993014
3 0.4729743146
As we can infer from the table, this model gives reasonably high MSE when compared to the other models.
We suspect that the main reason behind this is the fact
that the change in the ratings of a business over time
depends on a variety of different factors related to the
business and its surroundings, not just its previous ratings. This may be a reason why models including other
features of a business perform better than this model.
Due to time constraints, we were unable to proceed further with Time Series Analysis. We hope TSA will perform better if we incorporate time dependent features
instead of just the previous ratings.
10.
CROSS-FOLD VALIDATION
0.09)
0.11)
0.13)
0.13)
0.15)
0.10)
The accuracy is the mean of the returne scores variable i.e scores.mean() and the +/- value is scores.std()
* 2. We tried this for various different valued of cv and
did not observe much change. As expected from our
analysis so far, tf-idf performs better than the other
models and accuracy seems to increase for more number of reviews analysed.
11.
CONCLUSION
12.
FUTURE WORK
In our project, we generated business features and review test features separately and evaluated them. But
we did not test with any combined feature matrix including the both. We can combine the business features
and review text feature matrix and test how the combined set of features predict the ratings of a business.
Also, due to time constraints, we could not carry out
Time Series Analysis completely. The performed analysis using AR model did not give very good results.
We can improve this if we incorporate time dependent
features instead of just the previous ratings. We can
generate time based features such as total reviews so
far, cool funny useful votes so far etc. This can be used
in the Time Series Analysis to predict how the ratings
of a business would change in the future given the time
dependent features generated with ratings received so
far.
13.
REFERENCES