You are on page 1of 11

Amazon Fine Foods Reviews

About Data: The Amazon fine foods reviews data has 5 million reviews
where each review has 8 features which are
1.
2.
3.
4.
5.
6.
7.
8.

Product Id
User Id
User name
Helpfulness Ratio
Score
Time
Summary
Text

The data can be downloaded from this link. It is a 10 years long data. There
are 2 outliers in the data where the numerator is greater than denominator
in helpfulness ratio which cannot happen. That are removed. There are no
missing values.

Preparing Csv data from Raw Data: We just want to make the data into
csv format and for that the code is provided below.
import re
dude=open("G:\main 3 datasets\\finefoods.txt\\foods1.txt","w")
count=0
with open("G:\main 3 datasets\\finefoods.txt\\foods.txt") as adda:
for i in adda:
if (i!="\n"):
x=i.split(": ")
y=x[1][:-1]
y=re.sub("[,]"," ",y)
if (count!=8):
dude.write(y)
if (count!=7):
dude.write(",")
elif (count == 8):
dude.write("\n")
dude.write(y)
dude.write(",")
count=0
count = count + 1

Words tend to positive and negative reviews:


1) We have taken 4,5 scores as positives and 1,2 as negative leaving 3.
2) We have used nltk tools in python to tokenize, removing stop words
and stemming down the similar words.
3) We have written every row of text column of the data into bunch of
words or elements which can be used as rows of association rules.

4) There are so many words which are common for both types of reviews
but we are taking only which support only one type of review.
5) We have taken only 1 lakh rows of data
import pandas as pd
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import random
adda1=open("G:\main 3 datasets\\finefoods.txt\\words_tend to negative
reviews.txt","w")
adda=pd.read_csv("G:\main 3 datasets\\finefoods.txt\\foods2.txt",encoding = "ISO8859-1",usecols=["text","score"],nrows=100000)
tokens = []
stop_tokens = []
for i in range(len(adda)):
if adda["score"][i] in [1.0,2.0]:
tokenizer = RegexpTokenizer(r'\w+')
x = str(adda["text"][i])
raw = x.lower()
y = tokenizer.tokenize(raw)
en_stop = stopwords.words("english")
z = [i for i in y if not i in en_stop]
p_stemmer = PorterStemmer()
w = [p_stemmer.stem(i) for i in z]
for i in set(w):
adda1.write(i)
adda1.write(" ")
adda1.write("\n")

The words which tends to positive and negative reviews can be accessed
from this link. Association rules in spark can be learned from this link.
Predictive Model on Helpfulness:
1) We have taken entire data, as taking only 1 lakh rows will give so much
overfitting.
2) The features include score, text length and no of reviews particular
user had given.
3) The response feature is whether the review is helpful for at least for 1
customer or not (Binary classification).
4) Text length features has some 2300 different values and user features
has 87 values.
import pandas as pd
adda1=open("G:\main 3 datasets\\finefoods.txt\\foods3.txt","w")
adda=pd.read_csv("G:\main 3 datasets\\finefoods.txt\\foods2.txt",encoding = "ISO8859-1",usecols=["help","user_id","score","text"],nrows=500000)
adda["user_id"]=adda["user_id"].astype("category")
adda["user_id"]=adda["user_id"].cat.codes
laxman=[]
for i in adda["user_id"]:
laxman.append(i)

for i in range(len(adda["score"])):
x =int(adda["help"][i].split("/")[0])
y = int(adda["help"][i].split("/")[1])
if y!=0:
z=(x/y)*100
if z>0:
adda1.write(str(1))
else:adda1.write(str(0))
else:adda1.write(str(0))
adda1.write(",")
adda1.write(str(int(adda["score"][i])))
adda1.write(",")
adda1.write(str(laxman.count(adda["user_id"][i])))
adda1.write(",")
adda1.write(str(len(str(adda["text"][i]))))
adda1.write("\n")

We have applied Decision tree, Random Forest, Gradient Boosting methods


on this data. The results are
1) Decision Tree- 64.38%
2) Random Forest- 64.59%- 200 trees,37 depth
3) Gradient Boosting- 64.75% - 200 trees,17 depth
You can learn the above methods in this link.
Learning pattern in the data: Lets understand how individual columns
are behaving.
Products Histogram: It shows how many times a particular product id is
reviewed.

Helpfulness Percentages: Helpfulness percentages is divided into 4 groups


more than 75, between 75 and 50, between 0 and 50, and 0. Their frequency
kept in decreasing

Score Histogram: The frequency of 5 ratings are more than every other
rating combined. It shows most of the times the customers are satisfied with
the product.

User Histogram: There are 70398 different users. We have taken a partial
graph as the frequency who reviewed more than 5 times is very less.

Relation between Columns:

Rating vs average text length: The length of the text increased from rating 1
to 4 but decreased at 5. The reason might be that rating 5 dont need
enough explanation but rating 1 to 4 requires why they have to give less
than 5.

Ratings vs percentage helpfulness: It shows that higher ratings are more


helpful than lower.

But when helpful votes are taken but not percentages the graph looks like
this. Which shows till score 4 helpful votes increased but decreased at 5. The
customers felt helpful only if reviewer tells the lack points of the product.

User Id vs Percentage Helpfulness: The users are divided into 4 groups who
did more than 50 reviews, 10 to 50, 1 to 10 and 1. The graph shows that
those who did more reviews are the ones who gets more helpful votes.

Reviewers vs average Scores: From the graph we can see those who did
more reviews are the ones who gives high scores. The users are divided in
the same way as done in before graph.

Reviewers vs average text length: People who did more reviews write more
text. Users are divided in the same as above.

Percentage helpfulness vs average text length: Helpfulness is divided as


more than 75, between 50 and 75, between 0 and 50 and 0.

Without taking average values of text length and taking all helpful votes the
below graph is drawn.

You might also like