Professional Documents
Culture Documents
Abstract: Opinions are key influencers of our behaviours. In order to interact with hundreds of brands individuals
seek opinion from family, friends or relatives while organisations use surveys, opinion polls, consultants which can
damage or enhance its reputation. Hence it is imperative to process this information to understand brand perception
using Twitter sentiment analysis. Sentiment analysis also known as opinion mining is a field of study that analyses
people’s opinions, sentiments, evaluations, attitudes and emotions on the basis of polarity (positive, negative and
neutral). It is a part of machine learning technique where the high-quality information derived from text to mine the
customer perception about a particular brand. The goal of this paper is to provide a comprehensive study on various
machine-learning techniques like Naïve Bayes, Support Vector Machines, KNN (k-Nearest Neighbors) for extracting
sentiments automatically. There are various packages and frameworks in python that support sentiment analysis.
NCERTCSE-2019-02
2. RELATED WORK people with the combination of KNN algorithm and
Naïve Bayes algorithm. The proposed system
Sentiment analysis systems have found their
concludes the sentiments of tweets which are extracted
applications in almost every business and social
from twitter using its API. In [8] it gives a real-world
domain. In [1] it gives an in-depth introduction to this
example of text classification. It trains a machine
fascinating problem. It discusses the document-level,
learning model capable of predicting whether a given
sentence-level and aspect-based sentiment analysis
movie review is positive or negative. This is a classic
classification and showed that sentiment analysis is a
example of sentimental analysis where people's
multi-faceted problem with many challenging
sentiments towards a particular entity are classified
subproblems. The existing techniques for dealing with
into different categories. In [10] it covers three
them were discussed. After that, the book discussed
different ways of encoding categorical features:
the problem of sentiment lexicon generation with two
LabelEncoder and OneHotEncoder, DictVectorizer,
dominant approaches, and the problem of analyzing
Pandas get_dummies. In article [11] it talks about
comparative and superlative sentences. Such
another widely used classification technique called K-
sentences represent a different type of evaluation from
nearest neighbors (KNN). It focuses primarily on how
regular opinions which have been the focus of the
does the algorithm work and how does the input
current research. The topic of opinion search or
parameter effect the output/prediction. In [14] it
retrieval was introduced and finally, it discusses
investigated a novel approach to estimating attribute-
opinion spam detection and assessing the quality of
specific brand perceptions from social media to
reviews. To ensure the trustworthiness of such
provide a low-cost, real-time alternative to traditional
opinions, combating opinion spamming is an urgent
elicitation methods. It focuses analysis on the social
and critical task. In [3] they have firstly presented the
media platform Twitter. It provides a fully automated
detailed procedure to carryout sentiment analysis
and highly generalizable method. Extant data mining
process to classify highly unstructured data of Twitter
approaches in the marketing literature require context-
into positive or negative categories. Secondly, various
specific manual tuning and/or data-annotating to
techniques to carryout sentiment analysis on Twitter
implement, which can be as or more costly and time
data including knowledge based technique and
consuming as the manual direct-elicitation methods
machine learning techniques like Naïve Bayes,
they aim to replace. In [16] article for mobile review it
Maximum Entropy, Support Vector Machine and
proposes a method using Naïve Bayes, KNN and
Random Forest are discussed. In [4] it focuses on
modified k means clustering and found that it is more
building a Twitter Sentiment Analyzer. It aims to
accurate than Naïve Bayes and KNN techniques
extract tweets about a particular topic from twitter
individually. It obtained an overall classification
(recency = 1-7 days) and analyze the opinion of
accuracy of 91% on the test set of 500 mobile reviews.
tweeples (people who use twitter.com) on this topic as
It is much faster than other machine learning
positive, negative or neutral. It explains how to build
algorithms like Naïve Bayes classification or Support
such a sentiment analyzer. In [6] it presents an
Vector Machines which take a long time to converge
approach to perform aspect-level sentiment
to the optimal set of weights. The accuracy is
classification for twitter. It shows tweets collected
comparable to that of the current state-of-the-art
using twitter API, applied preprocessing on the tweets
algorithms used for sentiment classification on mobile
and performs POS tagging using R programming
reviews. In [15] it discusses the over fitting problem
language. It helps to understand the general sentiment
and proposes solution using SVM classifiers. The over
around the movie, which aspects people liked or
fitting problem cannot be removed totally but can
disliked and gain insights on how opinions change
minimize some amount. The proposed solution
over a period of time. In [7] we found that social media
reduces the time cost and increases the accuracy &
like twitter can be used to predict the sentiments of
performance of the system.
NCERTCSE-2019-02
3. METHODOLOGY FOR SENTIMENT data such as profile data and tweet messages. The
ANALYSIS former is static while the latter is dynamic. Tweets
could be textual, images, videos, URL or spam tweets.
Machine learning for Sentiment Analysis follows
Data in the machine learning context can either be
three phases: [3-15]
labelled or unlabeled. [17]
Phase 1 - Training Phase: Here the training data is
Unlabeled data: It consists of samples of natural or
used to train the model by pairing the given input with
human-created artifacts that we can easily obtain from
the expected output
the world. Some examples of unlabeled data include
Phase 2 - Validation phase: This phase measures the photos, audio recordings, videos, news articles, tweets
goodness of the learning model that has been trained etc.
and estimates the model properties such as error
Labeled Data: takes a set of unlabeled data and
measures, sensitivity, specificity recall, precision
augments each piece of that unlabeled data with some
recall and others. It uses a validation dataset and output
sort of meaningful “tag”, “label” or “class” that is
is sophisticated learning model.
somehow informative or desirable to know.
Phase 3 - Application Phase: In this phase model is
subject to real world data for which the results needs
to be derived.
NCERTCSE-2019-02
To train the classifiers we need to fetch the classified Y = f(X) (1)
tweets. The pre-processing makes it easier to extract
the information from the text and apply the machine Supervised learning problems can be further grouped
learning algorithms. into regressing and classification problems.
NCERTCSE-2019-02
The “K” in KNN algorithm is the nearest neighbors we Pick a misclassified example and select
wish to take vote from [11] another hyperplane by updating the value and
classify the data.
Load the data
Repeat the steps
Initialize the value of k
For getting the predicted class, iterate from 1 SVM has a technique called the kernel trick. These are
to total number of training data points functions which takes low dimensional input space
Calculate the distance between test data and and transform it to a higher dimensional space i.e. it
each row of training data. Here we will use converts not separable problem to separable problem,
Euclidean distance as our distance metric these functions are called kernels. It is mostly useful
√∑𝑁 2 in non-linear separation problem. Simply put, it does
𝑖=1(𝑎𝑖 − 𝑏𝑖 ) (3)
𝑤ℎ𝑒𝑟𝑒 {(𝑎1, 𝑏1), (𝑎2, 𝑏2), (𝑎3, 𝑏3) … . (𝑎𝑛, 𝑏𝑛)}𝑖𝑠 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑑𝑎𝑡𝑎 some extremely complex data transformations, then
Sort the calculated distances in ascending find out the process to separate the data based on the
order based on distance values labels or outputs you’ve defined. When we look at the
Get top k rows from the sorted array hyper-plane in original input space it looks like a
Get the most frequent class of these rows circle:
Return the predicted class
NCERTCSE-2019-02
The performance of sentiment classification [6] Kiruthika M.,Priyanka Giri, Sanjana Woonna,
techniques is estimated using four indicators as "Sentiment Analysis of Twitter Data" in ,International
follows: Journal of Innovations in Engineering and
Technology(IJIET), Volume 6 Issue 4 April 2016
Accuracy => (TP+TN)/(TP+FP+FN+TN) (4)
[7] Shubham Goyal "Review Paper on Sentiment
Precision => TP/(TP+FP) (5) Analysis of Twitter Data Using Text Mining and
Hybrid Classification Approach" , International
Recall => TP/(TP+FN) (6) Journal of Engineering Development and Research ,
Specificity => TN/(TN+FP) (7) Volume 5, Issue 2,2017
[8] "Text Classification with Python and Scikit-Learn"
Accuracy is the percent of predictions that were Usman Malik August 27, 2018. [Online]. Available:
correct. Recall is the percent of positive cases that we https://stackabuse.com/text-classification-with-
can catch. Precision is the percent of possible python-and-scikit-learn/
predictions that were correct. Specificity is the percent [9]"sklearn.feature_extraction.DictVectorizer".
of actual negatives that are correctly identified. [Online]. Available: https://scikit-
4. Conclusion: learn.org/stable/modules/generated/sklearn.feature_e
xtraction.DictVectorizer.html#sklearn.feature_extract
In this paper we have discussed about the techniques ion.DictVectorizer
that can be used for twitter data analysis for brand [10]"Encoding Categorical Features", Yang Liu Sep
perception. Once the data is fetched through Twitter 13,2018.[Online].
API, it is cleaned and preprocessed using Scikit-learn https://towardsdatascience.com/encoding-categorical-
which makes the data readable and executable for the features-21a2651a065c
machine learning classifiers. The data is then [11]"Introduction to k-Nearest Neighbors: Simplified
classified as training data and passed through the " TAVISH SRIVASTAVA, MARCH 26,
classifiers to train the modules and provide the 2018.[Online]. Available:
relevant results like accuracy, precision, recall and https://www.analyticsvidhya.com/blog/2018/03/intro
specificity. duction-k-neighbours-algorithm-clustering/
[12]"Understanding Support Vector Machine
References
algorithm from examples (along with code)" SUNIL
[1] Bing liu "Sentiment Analysis and Opinion RAY, SEPTEMBER 13, 2017. [Online].Available:
Mining", Morgan & Claypool Publishers, May 2012. https://www.analyticsvidhya.com/blog/2017/09/unde
[2] Bing liu "Sentiment Analysis and Subjectivity", rstaing-support-vector-machine-example-code/
Invited Chapter for the Handbook of Natural [13]"k Nearest Neighbor Classifier ( kNN )-Machine
Language Processing, Second Edition. March, 2010 Learning Algorithms" Shubham Panchal, Mar 9,
[3] Mitali Desai, Mayuri A. Mehta, "Techniques for 2018.[Online]. Available:
Sentiment Analysis of Twitter Data: A https://medium.com/@equipintelligence/k-nearest-
Comprehensive Survey" in International Conference neighbor-classifier-knn-machine-learning-
on Computing, Communication and Automation algorithms-ed62feb86582
(ICCCA) DOI: 10.1109/CCAA.2016.7813707 [14]Aron Culotta Jennifer Cutler,"Mining Brand
[4] "how to build a twitter sentiment analyzer?" Perceptions from Twitter Social Networks" Published
Ravikiran Janardhana May 8, 2012. [Online]. in Marketing Science, Articles in Advance 22 Feb
Available:https://ravikiranj.net/posts/2012/code/how- 2016
build-twitter-sentiment-analyzer/#id10 [15]Anjali Mahavar, Priya Pati, Abhishek Tripathi, "A
[5] "Learning to Classify Text",[Online]. Available: Survey Paper on Twitter Sentiment Analysis of
http://www.nltk.org/book/ch06.html
NCERTCSE-2019-02
Current Affairs" in International Journal for Scientific in International Journal of Computer Science and
Research & Development| Vol. 4, Issue 10, 2016 Mobile Computing, Vol.5 Issue.6, June- 2016
[16]Onam Bharti, Mrs. Monika Malhotra,
"SENTIMENT ANALYSIS ON TWITTER DATA"
NCERTCSE-2019-02