You are on page 1of 9

Yelp Business Study

Rima Gibbings, Sepideh Esmaeilpour, Mohammad Ghasemisharif,


Niloufar Dousti Mousavi, Neshat Mohammadi
December 9, 2016

Introduction

In modern society people are busy and look for information services that give them a competitive advantage. The best of these services offer precise information with fast response times. These type of services
enable businesses to estimate profitability and ranking prediction. Yelp is an example of this kind of service using customer feedback to create the most reliable ranking. We have used Yelps data to create our
own business ranking prediction system. Many machine learning design approaches have been discussed
including the Support Vector Machines, Bagging, Neural Networks, Random Forest, and the logistic regression methods as our prominent choices. What motivated us to gravitate towards logistic regression
and Random Forest was the effect of the weight for different features and their impact on user inclination
to rely on comparison websites such as Yelp. As a team, we aimed at basing our research decisions on
improvement of previously performed studies. To this end one of our main challenges was to identify
best practices that were also similar to our study. Several aspects of similarity were established by our
team: data set size and feature scale, input and output variable connectivity, optimal feature selection, and
meaningful outcome interpretation.
Finding previously performed studies that had similar goals and utilized a large number of data set
items was one of our main goals. Machine learning concepts are relatively new, and incorporating large
data sets that are not organized for study purposes introduce issues such as data cleansing, resolving missing data issues, and identifying meaningful output labels. To eliminate the main issues presented in the
initial study discussions we attempted to clean our data by eliminating data set files that were not directly
influential on our outcome variables. Features that were logically unrelated to user selection for a given
business, were selected in the first phase of feature elimination. Subsequently, we conducted several scenario discussions to identify meaningful features that will add to the value of prediction. During our team
meetings and extended discussion, we realized that identifying the breadth of a given variable and the
extent of its influence relies on different factors. In the case of Yelp businesses, we realized that the label
representation of the data items was not a guaranteed attestation to its actual position in the data, and the
number of reviews that were assigned to each rating should be reasonably investigated 1. Our team collaboration brought together a positive diversity of discussion methods. We emphasized on prioritizing our
study questions based on the business rating significance and selected a team-based approach of verifying
study methods to gain optimal accuracy.

Rate
4.5
4.0
4.0
4.0
4.0
3.5
3.5
3.5
3.5
3.5
3.5
3.5
3.0
3.0

Reviews
58
158
94
24
4
256
174
150
99
35
18
10
53
25

Price
2
2
2
2
2
2
2
2
2
2
2
1
2
1

Noise
average
average
loud
quiet
quiet
average
average
average
quiet
average
average
average
average
quiet

Attire
casual
casual
casual
casual
casual
casual
casual
casual
casual
casual
casual
casual
casual
casual

Alcohol
full bar
full bar
none
full bar
full bar
full bar
full bar
full bar
full bar
full bar
full bar
full bar
full bar
full bar

WiFi
free
no
free
free
free
no
free
free
free
free
free
free
free
free

Ages Allowed
None
None
None
None
None
None
None
None
None
None
None
None
allages
None

Smoking
outdoor
None
None
no
outdoor
no
outdoor
None
no
outdoor
outdoor
outdoor
outdoor
outdoor

BYOB/Corkage
None
None
yes free
None
None
no
None
None
None
None
None
None
None
None

Number of True Attributes


13
13
13
13
13
13
13
13
13
13
13
13
13
13

Table 1: To capture a meaningful correlation between number of true attributes, review counts and
ratings, we separated a handful of businesses with maximum number of true attributes from the
provided dataset. A little investigation in extracted table reveals that good ratings are not necessarily
guaranteed for the businesses with greater number of true attributes. It can also be concluded that
the similarity of two businesses in terms of review counts and true attribute counts, can not be an
indicator of equivalent rating

Figure 1: While rating 4 includes most reviewed businesses (right), the highest number of businesses
are not included in rating 4. Selecting 4.5 as a threshold resulted in the binary classes with drastic
difference in the number of subject items (restaurants) and in order to identify a fair threshold for
our outcome labels we decided to select 4 as the threshold point.

Background

As a research study, we had many doubts relying on Yelp reviews that could contain both bias and fake/subjective content. To resolve this phenomenon we expanded our research to include studies previously
done that examine the efficacy of social media reviews to determine their reliability and trustworthiness.
In reviewing the available literature in this area we encountered the comparison between movie reviews
and restaurant reviews. The intent of the reviewer in each case is very different which will subsequently
affect review content. It was mentioned that the sources of ratings in the movie and book industry rely
on well-established organizations and websites which provides a balancing act to the online review process. Also the deliberate element of book reviewer and the time and effort of the task was compared to
a restaurant visit concluding that these two types of reviewers were distinctively different. In addition
ii

to the review-reliability issue we also discovered that due to the actual challenges for privately-owned
restaurants the value of rating increase or decrease could directly affect their life span and income level
(Luca, 2011). These issues increased our focus on comparing the importance of reviews through different
machine learning methods and discuss the most appropriate tool with potential limitations.
Another topic that we investigated was the effect of Yelp reviews on chain restaurants. According to
recent studies, the reach and potentially popularity of chain restaurants have been decreasing due to Yelp
usability [3]. According to Yelp website 42% of adults age 18-34 consider Yelp reviews in their decision
making process and a 61% of college educated adults also review the ratings prior to their selection process [2]. These discoveries led us to contemplate the feature selection process. As mentioned in the study
we systematically reviewed descriptive features such as the attribute and transferred the data into measurable and effective features. In completing this process we considered all the decisive factors in chain and
privately-owned restaurants and selected the common factors. Future work could certainly elaborate on
this topic and determine (as recommended in our study) that based on a given algorithm, certain features
could play a more prominent role in labeling the study subjects.
In receiving good ratings through social media in general and Yelp in particular we considered the
continuity of such determining factors. We discussed the possibility of restaurants reducing the level of
investment in the factors that mainly contributed to the high ratings after obtaining such desired rating [2].
This concept will be resolved in a product based on our research as it will continuously evaluate the
labels according to up-to-date data/information provided. However, some limitations that could become
challenging is building an algorithm that could transfer the descriptive data provided in the JSON files into
an established feature format. As the raw data is gathered from Yelp files new and undefined vocabulary
might not exist in the existing categorization of new features that we created, prompting us to build a new
labeling/classification system that could construct features based on the vocabulary characteristics.

Methodology

While evaluating the Yelp dataset, the initial intuition was to define a mapping function that gets the
feature values for a given sample and allocates a numerical scalar value to the selected sample. This
value would be our scale to discriminate between samples in different categories. In addition, by digging
into the real world value of the provided features, we concluded that each feature needs to be weighted
by its significance. The above mentioned steps were performed to emphasize on the relevance of data
examination process in objective studies. Our study also affirmed on applying a set of cleaning, labeling
and standard classification methods to our processes.

Design

Due to the nature of our dataset, we decided to use one-hot-encoding. Considering the fact that the dimension of features in the Yelp data set varied significantly. In order to use linear regression, we decided
to consider the feature with the largest dimension as our test measure and convert all of the remaining
features to the same dimension. In the following section we will provide further details about the conversion process of the categorical variables into dummy variables using Python provided packages such as
the pandas package.

iii

Figure 2: We ultimately zoned on 27 features. As it is shown, review count has a significant impact on random forests feature selection (left). However, the scale difference can affect the feature
importance. The right plot depicts the coefficients of our logistic regression model in binary labels.

4.1

Data examination phase

We proceeded by evaluating all potential correlations between distinctive features and outcome labels. A
primitive approach that seemed to give us an intuition of the weights for different features was using linear
regression method.

4.2

Classification phase

Considering the obtained results in data examination phase, we chose to apply the logistic regression
as our early Classification approach to discover the relation between features and output labels. After
analyzing the results and due to the low value of coefficients for a considerable number of raw features,
we determined that utilizing effective approaches that could influence our feature selection process and
subsequently produce higher accuracy was a compelling priority. Random Forest was one of the more
popular methods that we investigated to identify impactful features. Two main reasons we decided to use
this algorithm were that it is competent in terms of accuracy compared to other algorithms and also it
has proven high efficiency in working with large databases. Furthermore, it produces a relatively reliable
initial set of features. However, in our research project we encountered a variety of features containing
large variations in their scale of measurement. The concept of such features has been previously studied
and shown unjustifiable significance on the outcome predictive labels. [4] To resolve such imbalance we
proceeded with a one-on-one comparison for the entire modified features in both Random Forest and
Logistic regression models 2. The comparison was aimed at identifying the most optimum set of features
relevant to the Yelp outcome. By Employing these two methods, we were able to clean the data in a format
that resulted in more dependable feature values which positively impacted the classification methods.
Moreover; obtaining accuracy values for various multi-class and binary class tasks, paved the way for
iv

Figure 3: Here in this experiment, originally when we started, we took the 9 label approach, then
we took 5 label approach, then the binary label approach as outcome classification and accuracy
was not reliable and it did not make sense.In next approach we started modifying our features,
originally we have a big data set (470 set of features), then we removed some predictors, some of
categorizations, and other features that we thought they are less relevant. Moreover, we observed
a huge gap between 0 and 1 labels which was caused by imbalanced number of samples. Thus by
changing the threshold and balancing the number of samples, we achieved 60% precision.
many other approaches that we used accordingly. In fact these two approaches built the initial foundation
for input feature space, and defined our binary class labels for the rest of project. We arranged the rest of the
experiments applying Nearest Neighbors, Linear SVC, Decision Tree, Neural Network, AdaBoost, Naive
Bayes and QDA techniques. Gathering all the results, we were able to compare between the performance
measures of different approaches ??.
Experiment
#1
#2
#3
#4
#5
#6
#7
#8
#9

Average Precision
0.22
0.22
0.22
0.21
0.44
0.59
0.58
0.70
0.60

Average Recall
0.25
0.25
0.25
0.25
0.47
0.68
0.69
0.81
0.61

Average F1-score
0.19
0.19
0.19
0.21
0.44
0.58
0.57
0.74
0.57

Support data
85901
85901
85901
34503
34503
85901
85901
34503
34503

Table 2: Logistic Regression results in 9 experiments. Experiments 1-4, 5, 6-9 use 9,5, and 2 (binary)
labels respectively. As it is shown in 3, binary label with threshold of 4.0 has fair and better precision.

Implementation

As one of the popular languages in performing research, Python offers a rich ecosystem of libraries dedicated to statistics and machine learning. In this study, we use various Python packages such as Pandas,
NumPy, matplotlib, and scikit-learn for data modification, scientific computing, plotting, and machine
v

Figure 4: Shows the number of experiments that we conducted in total to reach to a satisfactory
precision level. In the beginning of the study we considered 9 outcome labels and to improve the
testing process we proceeded with feature modification and label reduction. Moreover, we observed
a huge gap between 0 and 1 labels which was caused by imbalanced number of samples. Thus by
changing the threshold and balancing the number of samples, we achieved 60% precision.
learning respectively. We clean the data to prepare it for each evaluation method in multiple phases. The
Yelp dataset package was provided in several JSON files and each data set contained a list of dictionaries.
The dictionaries included field descriptions and attribute interpretations that offered helpful insight in the
feature formation. Attributes are either true/false or have categorical values (e.g. Noise Level is high,
average, or low). As an initial step, we considered the Stars attribute to be the outcome variable of the
study. Each record was allocated a Stars field based on the average rating of Yelp users/customers for a
given business. Since the reviewers choice is tied to the two most decisive attributes: number of Stars
and the total number of reviews, it is not an implausible assumption to consider attributes as features and
Stars as labels. Since the labels are presented within a range of [0,5], we followed different labeling
approaches for the predictive outcome variable in the dataset and ran the experiments with each option.
After each iteration, we fully analyzed the results and took further actions to either exclude or include the
appropriate features in the next iteration.

5.1

Data cleaning and observations

In our initial analysis, we examined all the available features provided in the Business dataset to identify
missing data based on incomplete features and determine most appropriate replacements for the missing
values. We considered adding a false value for such features. In studying the dataset features, we encountered the attribute field that contained varying set of words analogous to business characteristics. This
feature prompted us to associate the presence of the words with numerical values and ultimately discuss
the total effect on the predictive outcome. To ensure the validity of our approach with the attribute feature,
we investigated a correlation between the number of true attributes and ratings (labels). We observed
the maximum of 13 features in 14 records and as it is shown in 1 there is no correlation between ratings
and number of true attributes. We imported the new feature within our dataset by adding a new column
containing the total number of true attributes. To study the effect of the newly added features, we divided the dataset into each individual rating group and sorted the first 10 subject items in each group based
on their review count and number of available business related images. These groups were individually
considered to evaluate the relationship between the new feature and the output labels. Although there is
a small number of true attributes among very low ratings (1.0 and 1.5), there seems to be no distinct

vi

Figure 5: We selected 10 items from our testing data, to see what we are going to predict in our
system is going to come up to what the actual rating of that restaurant is, and as we see in 7 of the
10 we actually predicted accurately. This is an improvement compared to our model accuracy of 60
percent.
correlation for the remaining groups of ratings. Building on our initial naive approach of evaluating the
attribute feature, we improved our study method to present the attribute values in a more meaningful form.
Utilizing available dictionaries in each attribute field, we transferred each value to a new feature in order to
assess its true impact on the output label. Furthermore, we investigated the relationship between the photo
counts and review counts for a given business. Based on a bar plot that compares these two features 6, we
observed a direct correspondence between them. In other words, highly reviewed businesses are the ones
with more images submitted by patrons.
After analyzing our features, we decided to assess the impact and efficiency of labeling classification.
To this end we attempted to define best classification through a trial and error process. The labeling process was completed in three separate phases. (I) nine labels L {1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0},
(II) five labels L {1, 2, 3, 4, 5} and (III) binary labels L {0, 1} where L is equal to 1 if rating is equal
or greater than a specified threshold and is 0 otherwise 2 3.
Since our dataset had a descriptive nature it contained a variety of variable types.Variables were represented by numeric values, categorical values, and binary (true and false) values. In this phase we determined that it will serve our study goal better if we convert categorical variables into dummy variables and
represent their availability with a n recordscategoricalf eatures matrix. We use pandas.get dummies
function for this purpose. Finally, we excluded features that are not representable or were irrelevant to our
study such as full address, name, neighborhoods etc.

Results

The initial experiments with logistic regression were arranged in the following order. A range of modified,
combined or eliminated features and also various label definitions were examined to find the optimal
setting in terms of accuracy. For instance the categories feature consisted of huge number of items which
were completely varied even for similar businesses. For simplification we decided to merely consider
the length of this feature in experiments for experiments 1 and 2. This feature was eliminated in the
experiments that only used the restaurant instances. In addition we restricted the number of features to 27
vii

Figure 6: As it is illustrated, there is a proportional correspondence between review counts and


photo counts for all businesses and it can be interpreted in two different ways. The first assumption is
that a percentage of reviewers has also uploaded photos of their experience. The second assumption
is that the business owners who manually upload photos tend to attract more customers/reviewers.
in experiments 7, 8, and 9. This was accomplished by eliminating the dummy variable cities.
The illustrated plot 4 shows the obtained accuracies for each of the methods in our experiment. The
proportion of testing data to training data is altered in 4 different ways to investigate the effect of training
data in each algorithm. Comparing the settings that we considered for decision trees and random forests,
eventually we concluded that for decision trees with lower depth the results are more accurate than the
random forest algorithm; however, as we increased the depth to 15 with 100 trees, Random forest predictions improved. The optimal result for random forest was an accuracy of 62% which is independent of
the number of trees for T100. Neural Network was trained on various numbers of hidden layers (depth
and breadth), but the best accuracy,62% , was achieved with (30,30,30) number of hidden layers. Nearest
neighbors accuracy with K=5 was about 57% and did not improve any further by increasing the number of
neighbors. We were able to get 58% of accuracy by applying linear-SVM. Nave Bayes yield 47% accuracy
and the lowest accuracy obtained by QDA was 44%.

Related Work

As discussed in [1], Yelp data set was utilized to create a SVM ranking. The explored features in this
project included users food preferences and dietary restrictions, cuisine type, services, ambiance, noise
level, average rating, etc. The accuracy achieved in this specific study was at 72%. The project also
proposed a metric to maximize minimum happiness that is reflective of the real-world social situation.
Similar methods for group recommendation provide an alternative to other group recommendation systems
in which the suggested restaurant for each individual user model is an aggregate across the same group.

Future works

For future work as we discussed about review count impact, we discovered that it is important to consider the finite nature of our data and the review count imbalance present within our rating groups. To
counter the effect of review counts for certain rating groups, we discussed several solutions. Considering
the possibility of taking a certain duration of time for comparison or adjusting for the review count prior
to comparison were some of the options discussed. A second topic that seemed critical for the future work
viii

Classifier
Nearest Neighbors
Linear SVM
Decision Tree
Random Forest
Neural Net
AdaBoost
Naive Bayes
QDA

Accuracy
0.571656281698
0.57774235618
0.604115345602
0.629111722939
0.629329082742
0.622953195189
0.470439066802
0.440153600927

Table 3: Final results on restaurants dataset with binary labels and a threshold of 4. The train/test
ratio is 60/40.
of our project was using a more balanced set of features by doing normalization on the set of features,
since in certain study methods the quantity of higher density features demonstrate unjustified significance.
Regarding the descriptive nature of the words present in the different features we recommend assigning
appropriate weights to differentiate between the impact of more influential features compared to less effective ones. Finally, we recommend building an algorithm that can classify the subject items prior to
the implementation of the feature study. The idea is to consider appropriate features per given restaurant
examples.

Conclusion

The behavior of feature-selection algorithms is very complicated and performance depends strongly on the
classification rules, feature-label distribution, and sample size. One algorithm may outperform another for
a particular distribution or sample size, but be significantly outperformed on a different distribution or even
on the same distribution for a different sample size. Perhaps most importantly, in small-sample settings,
especially in the presence of high dimensionality, there is often little correlation between the errors for the
selected and best feature sets. Owing to its importance in contemporary high-throughput Yelp datasets,
there needs to a serious effort to understand feature selection. We suggest to readers who want to continue
this topic to be more focused on the elite members.

References
[1] A Preference-Based Restaurant Recommendation System for Individuals and Groups., howpublished = https:
//www.cs.cornell.edu/rahmtin/files/yelpclassproject.pdf, note = Accessed: 2016-12-09.
[2] P. Hajas, L. Gutierrez, and M. S. Krishnamoorthy. Analysis of yelp reviews. arXiv preprint arXiv:1407.1443,
2014.
[3] M. Luca. Reviews, reputation, and revenue: The case of yelp. com. Com (September 16, 2011). Harvard
Business School NOM Unit Working Paper, (12-016), 2011.
[4] C. Strobl, A.-L. Boulesteix, A. Zeileis, and T. Hothorn. Bias in random forest variable importance measures:
Illustrations, sources and a solution. BMC bioinformatics, 8(1):1, 2007.

ix

You might also like