You are on page 1of 7

Sentiment Analysis on Twitter Data for Brand Perception: A review

Mrs. Shivani Agarwal


M.Tech. Research Scholar, RGPV University, Indore, MP
Agarwalshivani82@gmail.com

Abstract: Opinions are key influencers of our behaviours. In order to interact with hundreds of brands individuals
seek opinion from family, friends or relatives while organisations use surveys, opinion polls, consultants which can
damage or enhance its reputation. Hence it is imperative to process this information to understand brand perception
using Twitter sentiment analysis. Sentiment analysis also known as opinion mining is a field of study that analyses
people’s opinions, sentiments, evaluations, attitudes and emotions on the basis of polarity (positive, negative and
neutral). It is a part of machine learning technique where the high-quality information derived from text to mine the
customer perception about a particular brand. The goal of this paper is to provide a comprehensive study on various
machine-learning techniques like Naïve Bayes, Support Vector Machines, KNN (k-Nearest Neighbors) for extracting
sentiments automatically. There are various packages and frameworks in python that support sentiment analysis.

KEYWORDS: Twitter Data, Sentiment Analysis, Machine learning, Python programming

1. INTRODUCTION of this need. Tweets can be classified as (positive,


negative or neural) using automatic sentiment
Opinions are subjective expressions that describe
extractor. [1][2]
people’s sentiments, appraisals or feelings toward
entities, events and their properties. The concept of Using this analyzer, Business and organizations can
opinion is very broad. In this paper, we only focus on benchmark products and services, can use it as part of
opinion expressions that convey people’s positive or market intelligence. Enable individuals to shortlist
negative sentiments. With the growing usage of Web, brands and find public opinions. To gather critical
people can post their opinions or reviews about any feedback about problems in newly released product.
brand on Internet forums, discussion groups or blogs Twitter moods are used to predict the sales of products.
which are collectively known as user-generated To understand deeply, how twitter helps marketing of
content. For purchase of any product people are no any brand there are some interesting twitter
longer limited to opinion of their family or friends, statistics.[4]
organizations are not dependent on survey polls or
 There are 500 million Tweets sent each day.
external consultants to get product reviews as such
That’s 6,000 Tweets every second
information can be easily obtained through user
generated content. Twitter is one such popular micro  65.8% of US companies with 100+
blogging service where users create status messages employees use Twitter for marketing 80% of
(called tweets). Tweets are used to express public and Twitter users have mentioned a brand in a
private opinion about any product, service or any topic Tweet
relating to their daily lives. Opinions expressed in the  76% of consumers are likely to recommend
form of tweets are limited to 140 characters. But due the brand following friendly service
to large no of diverse sources and huge volume of text  People are 31% more likely to recall what
it becomes a formidable task for a human reader to they saw on Twitter.
extract relevant information. Thus automated opinion
discovery or summarization systems are needed.
Sentiment Analysis is a field of study which grows out

NCERTCSE-2019-02
2. RELATED WORK people with the combination of KNN algorithm and
Naïve Bayes algorithm. The proposed system
Sentiment analysis systems have found their
concludes the sentiments of tweets which are extracted
applications in almost every business and social
from twitter using its API. In [8] it gives a real-world
domain. In [1] it gives an in-depth introduction to this
example of text classification. It trains a machine
fascinating problem. It discusses the document-level,
learning model capable of predicting whether a given
sentence-level and aspect-based sentiment analysis
movie review is positive or negative. This is a classic
classification and showed that sentiment analysis is a
example of sentimental analysis where people's
multi-faceted problem with many challenging
sentiments towards a particular entity are classified
subproblems. The existing techniques for dealing with
into different categories. In [10] it covers three
them were discussed. After that, the book discussed
different ways of encoding categorical features:
the problem of sentiment lexicon generation with two
LabelEncoder and OneHotEncoder, DictVectorizer,
dominant approaches, and the problem of analyzing
Pandas get_dummies. In article [11] it talks about
comparative and superlative sentences. Such
another widely used classification technique called K-
sentences represent a different type of evaluation from
nearest neighbors (KNN). It focuses primarily on how
regular opinions which have been the focus of the
does the algorithm work and how does the input
current research. The topic of opinion search or
parameter effect the output/prediction. In [14] it
retrieval was introduced and finally, it discusses
investigated a novel approach to estimating attribute-
opinion spam detection and assessing the quality of
specific brand perceptions from social media to
reviews. To ensure the trustworthiness of such
provide a low-cost, real-time alternative to traditional
opinions, combating opinion spamming is an urgent
elicitation methods. It focuses analysis on the social
and critical task. In [3] they have firstly presented the
media platform Twitter. It provides a fully automated
detailed procedure to carryout sentiment analysis
and highly generalizable method. Extant data mining
process to classify highly unstructured data of Twitter
approaches in the marketing literature require context-
into positive or negative categories. Secondly, various
specific manual tuning and/or data-annotating to
techniques to carryout sentiment analysis on Twitter
implement, which can be as or more costly and time
data including knowledge based technique and
consuming as the manual direct-elicitation methods
machine learning techniques like Naïve Bayes,
they aim to replace. In [16] article for mobile review it
Maximum Entropy, Support Vector Machine and
proposes a method using Naïve Bayes, KNN and
Random Forest are discussed. In [4] it focuses on
modified k means clustering and found that it is more
building a Twitter Sentiment Analyzer. It aims to
accurate than Naïve Bayes and KNN techniques
extract tweets about a particular topic from twitter
individually. It obtained an overall classification
(recency = 1-7 days) and analyze the opinion of
accuracy of 91% on the test set of 500 mobile reviews.
tweeples (people who use twitter.com) on this topic as
It is much faster than other machine learning
positive, negative or neutral. It explains how to build
algorithms like Naïve Bayes classification or Support
such a sentiment analyzer. In [6] it presents an
Vector Machines which take a long time to converge
approach to perform aspect-level sentiment
to the optimal set of weights. The accuracy is
classification for twitter. It shows tweets collected
comparable to that of the current state-of-the-art
using twitter API, applied preprocessing on the tweets
algorithms used for sentiment classification on mobile
and performs POS tagging using R programming
reviews. In [15] it discusses the over fitting problem
language. It helps to understand the general sentiment
and proposes solution using SVM classifiers. The over
around the movie, which aspects people liked or
fitting problem cannot be removed totally but can
disliked and gain insights on how opinions change
minimize some amount. The proposed solution
over a period of time. In [7] we found that social media
reduces the time cost and increases the accuracy &
like twitter can be used to predict the sentiments of
performance of the system.

NCERTCSE-2019-02
3. METHODOLOGY FOR SENTIMENT data such as profile data and tweet messages. The
ANALYSIS former is static while the latter is dynamic. Tweets
could be textual, images, videos, URL or spam tweets.
Machine learning for Sentiment Analysis follows
Data in the machine learning context can either be
three phases: [3-15]
labelled or unlabeled. [17]
Phase 1 - Training Phase: Here the training data is
Unlabeled data: It consists of samples of natural or
used to train the model by pairing the given input with
human-created artifacts that we can easily obtain from
the expected output
the world. Some examples of unlabeled data include
Phase 2 - Validation phase: This phase measures the photos, audio recordings, videos, news articles, tweets
goodness of the learning model that has been trained etc.
and estimates the model properties such as error
Labeled Data: takes a set of unlabeled data and
measures, sensitivity, specificity recall, precision
augments each piece of that unlabeled data with some
recall and others. It uses a validation dataset and output
sort of meaningful “tag”, “label” or “class” that is
is sophisticated learning model.
somehow informative or desirable to know.
Phase 3 - Application Phase: In this phase model is
subject to real world data for which the results needs
to be derived.

The figure shows how learning can be applied to


predict the model.

Fig 2: Labeled and Unlabeled data

3.2 Pre-processing of Data:

The collected data is raw data and to apply it to the


classifier it is essential to pre-process or clean the data.
The Natural Language Toolkit (NLTK) along with
Python (3.x) will be used. The following steps involve
the pre-processing procedure: [3]

 Convert the tweets to lowercase.


 Remove twitter notations such as hashtags
(#), retweets (RT), account id (@) and
Fig 1: Phases for performing ML punctuations(!).
 Remove the URLs, hyperlinks and emotions
(remove the non-letter data and symbols)
2.1 Data Collection:  Remove the stop words such as is, are, and,
in etc. as they don’t signify any emotion.
Data forms the main source of learning in machine
 Compress the elongated words such as best to
learning. It can be either labelled or unlabeled. Data
best.
can be retrieved using Twitter API and can be in any
 Decompress the slang words such as g8, f9 as
format, can be received at any frequency and can be of
they main contain extreme level of
any size. It is a widely used application to read and
sentiments.
write twitter data. There are different types of twitter

NCERTCSE-2019-02
To train the classifiers we need to fetch the classified Y = f(X) (1)
tweets. The pre-processing makes it easier to extract
the information from the text and apply the machine Supervised learning problems can be further grouped
learning algorithms. into regressing and classification problems.

Classification: A classification problem is when the


3.3 Feature Extraction:
output variable is a category, such as “red” or “blue” or
The Python’s Scikit-learn library will be used to “disease” and “no disease”.
extract features in a format supported by machine
learning algorithms from datasets consisting of Regression: A regression problem is when the output
formats such as text and image. [9-10] variable is a real value, such as “dollars” or “weight”.
Supervised Machine Learning Algorithms:
2.3.1 The Bag of Words representation enables most
common ways to extract numerical features from text Machine learning methods like Naïve Bayes, K-nearest
content, namely: Neighbor and Support Vector Machines (SVM) can be
automatically used to build the classification
 tokenizing strings and giving an integer id for models based on the data in a training set.
each possible token
2.4.1 Naïve Bayes:
 counting the occurrences of tokens in each
document. Naïve Bayes classifier uses the Bayes Theorem. It
 normalizing and weighting with diminishing assumes that all features are independent of each other.
importance tokens that occur in the majority This simplifying assumption, known as the Naive
of samples / documents. Bayes assumption makes it much easier to combine
 A corpus of documents can thus be the contributions of the different features, since we
represented by a matrix with one row per don't need to worry about how they should interact
document and one column per token (e.g. with one another. [5]
word) occurring in the corpus. We
P(label) is the prior probability, we can calculate an
call vectorization the general process of
expression for P(label|features), the probability that
turning a collection of text documents into
an input will have a particular label given that it has a
numerical feature vectors.
particular set of features.
2.3.2 Pandas get dummies method is a very straight
P(label/features) =
forward one step procedure to get the dummy variables
P(label)*P(f1|label)*……*P(fn|label)/P(features) (2)
for categorical features. The advantage is you can
directly apply it on the data frame and the algorithm 2.4.2 KNN Classification:
inside will recognize the categorical features and
perform get dummies operation on it. K-Nearest Neighbor is a simple algorithm that stores
all available cases and predict the target based on
2.4 Classification: similarity measure.
We can implement a KNN model by following the
Machine learning techniques are classified into
below steps: [13]
Supervised and Unsupervised techniques. To carry out
Sentiment Analysis for twitter data supervised machine
learning techniques that highly depend on training data
which are already labeled data will be used. Supervised
learning is where you have input variables (x) and an
output variable (Y) and you use an algorithm to learn
the mapping function from the input to the output. [3] Fig 3: K-Nearest Neighbor

NCERTCSE-2019-02
The “K” in KNN algorithm is the nearest neighbors we  Pick a misclassified example and select
wish to take vote from [11] another hyperplane by updating the value and
classify the data.
 Load the data
 Repeat the steps
 Initialize the value of k
 For getting the predicted class, iterate from 1 SVM has a technique called the kernel trick. These are
to total number of training data points functions which takes low dimensional input space
 Calculate the distance between test data and and transform it to a higher dimensional space i.e. it
each row of training data. Here we will use converts not separable problem to separable problem,
Euclidean distance as our distance metric these functions are called kernels. It is mostly useful
√∑𝑁 2 in non-linear separation problem. Simply put, it does
𝑖=1(𝑎𝑖 − 𝑏𝑖 ) (3)
𝑤ℎ𝑒𝑟𝑒 {(𝑎1, 𝑏1), (𝑎2, 𝑏2), (𝑎3, 𝑏3) … . (𝑎𝑛, 𝑏𝑛)}𝑖𝑠 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑑𝑎𝑡𝑎 some extremely complex data transformations, then
 Sort the calculated distances in ascending find out the process to separate the data based on the
order based on distance values labels or outputs you’ve defined. When we look at the
 Get top k rows from the sorted array hyper-plane in original input space it looks like a
 Get the most frequent class of these rows circle:
 Return the predicted class

2.4.3 Support Vector Machine

A Supervised Machine Learning algorithm can be


used for both classification and regression challenges.
The goal of SVM is to find the hyperplane that
separates the two classes. [12]

Fig 5: Possible hyperplane

2.5 Evaluation Parameters

In classification problem, we can represent error using


“Confusion Matrix. [3]

Table1: Confusion matrix

Fig 4: Possible hyperplanes Predicted Predicted


Positives Negatives
Actual True Positive False Negative
Positive (TP) (FN)
 Multiple Hyperplanes can be used to classify Actual False Positive True Negative
the data. Negative (FP) (FN)
 Using PLA (Perceptron Learning Algorithm)
start with a random hyperplane and use it to
classify data.

NCERTCSE-2019-02
The performance of sentiment classification [6] Kiruthika M.,Priyanka Giri, Sanjana Woonna,
techniques is estimated using four indicators as "Sentiment Analysis of Twitter Data" in ,International
follows: Journal of Innovations in Engineering and
Technology(IJIET), Volume 6 Issue 4 April 2016
Accuracy => (TP+TN)/(TP+FP+FN+TN) (4)
[7] Shubham Goyal "Review Paper on Sentiment
Precision => TP/(TP+FP) (5) Analysis of Twitter Data Using Text Mining and
Hybrid Classification Approach" , International
Recall => TP/(TP+FN) (6) Journal of Engineering Development and Research ,
Specificity => TN/(TN+FP) (7) Volume 5, Issue 2,2017
[8] "Text Classification with Python and Scikit-Learn"
Accuracy is the percent of predictions that were Usman Malik August 27, 2018. [Online]. Available:
correct. Recall is the percent of positive cases that we https://stackabuse.com/text-classification-with-
can catch. Precision is the percent of possible python-and-scikit-learn/
predictions that were correct. Specificity is the percent [9]"sklearn.feature_extraction.DictVectorizer".
of actual negatives that are correctly identified. [Online]. Available: https://scikit-
4. Conclusion: learn.org/stable/modules/generated/sklearn.feature_e
xtraction.DictVectorizer.html#sklearn.feature_extract
In this paper we have discussed about the techniques ion.DictVectorizer
that can be used for twitter data analysis for brand [10]"Encoding Categorical Features", Yang Liu Sep
perception. Once the data is fetched through Twitter 13,2018.[Online].
API, it is cleaned and preprocessed using Scikit-learn https://towardsdatascience.com/encoding-categorical-
which makes the data readable and executable for the features-21a2651a065c
machine learning classifiers. The data is then [11]"Introduction to k-Nearest Neighbors: Simplified
classified as training data and passed through the " TAVISH SRIVASTAVA, MARCH 26,
classifiers to train the modules and provide the 2018.[Online]. Available:
relevant results like accuracy, precision, recall and https://www.analyticsvidhya.com/blog/2018/03/intro
specificity. duction-k-neighbours-algorithm-clustering/
[12]"Understanding Support Vector Machine
References
algorithm from examples (along with code)" SUNIL
[1] Bing liu "Sentiment Analysis and Opinion RAY, SEPTEMBER 13, 2017. [Online].Available:
Mining", Morgan & Claypool Publishers, May 2012. https://www.analyticsvidhya.com/blog/2017/09/unde
[2] Bing liu "Sentiment Analysis and Subjectivity", rstaing-support-vector-machine-example-code/
Invited Chapter for the Handbook of Natural [13]"k Nearest Neighbor Classifier ( kNN )-Machine
Language Processing, Second Edition. March, 2010 Learning Algorithms" Shubham Panchal, Mar 9,
[3] Mitali Desai, Mayuri A. Mehta, "Techniques for 2018.[Online]. Available:
Sentiment Analysis of Twitter Data: A https://medium.com/@equipintelligence/k-nearest-
Comprehensive Survey" in International Conference neighbor-classifier-knn-machine-learning-
on Computing, Communication and Automation algorithms-ed62feb86582
(ICCCA) DOI: 10.1109/CCAA.2016.7813707 [14]Aron Culotta Jennifer Cutler,"Mining Brand
[4] "how to build a twitter sentiment analyzer?" Perceptions from Twitter Social Networks" Published
Ravikiran Janardhana May 8, 2012. [Online]. in Marketing Science, Articles in Advance 22 Feb
Available:https://ravikiranj.net/posts/2012/code/how- 2016
build-twitter-sentiment-analyzer/#id10 [15]Anjali Mahavar, Priya Pati, Abhishek Tripathi, "A
[5] "Learning to Classify Text",[Online]. Available: Survey Paper on Twitter Sentiment Analysis of
http://www.nltk.org/book/ch06.html

NCERTCSE-2019-02
Current Affairs" in International Journal for Scientific in International Journal of Computer Science and
Research & Development| Vol. 4, Issue 10, 2016 Mobile Computing, Vol.5 Issue.6, June- 2016
[16]Onam Bharti, Mrs. Monika Malhotra,
"SENTIMENT ANALYSIS ON TWITTER DATA"

NCERTCSE-2019-02