You are on page 1of 5

Proceedings of the 11th INDIACom; INDIACom-2017; IEEE Conference ID: 40353

2017 4 International Conference on “Computing for Sustainable Global Development”, 01st - 03rd March, 2017
th

Bharati Vidyapeeth's Institute of Computer Applications and Management (BVICAM), New Delhi (INDIA)

A study on live Sentiment Analysis using Twitter data


Devesh Agrawal HimeshKulshrestha AkanshaTanwar
Department of Computer Scie Department of Computer Science Department of Computer Science
BVCOE, New Delhi-110063 BVCOE, New Delhi-110063 BVCOE, New Delhi-110063
Email ID: dagrawal1096@gmail.com Email ID: himeshkulshrestha@gmail.com Email ID: akanshatanwar14@gmail.com

Chirag Bansal Saurabh Chawla


Department of Computer Science Department of Computer Science
BVCOE, New Delhi-110063 BVCOE, New Delhi-110063
Email ID: bansalchirag1994@gmail.com Email ID: saurabh.chawla2308@yahoo.in

Abstract—In the recent times, the number of social they can give the people what they want. Large political houses
networking and blogging sites have grown at a tremendous are more active on active in online promotion than offline
pace and so have the number of their users. These sites provide campaigning. More and more researchers are trying to dive deep
a platform to users where they can share their opinions and into the subject of opinion mining or sentiment analysis.
sentiments about various things related to different fields like
Sentiment Analysis or Opinion Mining involves finding out
movies, politics etc. Twitter is such a popular microblogging
site where users share their opinions, sentiments, emotions and relevant information from source material using techniques like
other in the form of short texts and images. It has a large Natural Language Processing and Machine Learning. It is
outreach with millions of users and thus is an outstanding usually aimed at finding out what the speaker/writer meant while
source of opinion collection and sentiment analysis. In this saying/writing the sentence.
paper, we try to perform sentiment analysis by collecting data Twitter has about 317 million monthly active users currently and
from the Twitter using its API and using natural language thus it serves as one of the most favored, reliable and easy to use
processing and machine learning algorithms like Naïve Bayes datasource for sentiment analysis. It has a large database with
algorithm and its variants, Logistic Regression, Support Vector about 500 million tweets added every day and thus there is
Clustering. We also try to identify the challenges and problems
abundant data to be used. Tweets on Twitter are 140 character
associated with these existing algorithms.
long messages which involve text, emoji and links. All Users
Keywords—Sentiment Analysis, Opinion Mining, Twitter, have a username which is called a handle and they tweet using
classifier, Naïve Bayes, Logistic Regression, Support Vector that handle only. So, while using Twitter data, we are actually
Clustering using these tweets as they represent the emotions, opinions,
judgements, facts which the users share. We use this Twitter data
I. INTRODUCTION using an API (Application Programming Interface) service
Every person has opinions and sentiments and he takes them known as Tweepy. This service allows us to stream live tweets
into account while making any important decision. These from the Twitter website and perform live sentiment analysis,
opinions and sentiments of a person about a particular thing this helps us generate real time data which can be extremely
depend on a lot of factors including natural instincts, useful in many cases like elections where a party might want to
personality but more than that we depend upon opinions of see real time sentiments of people.
other people about that thing. For example, if we have to
watch a particular movie, we rely on other people’s review II. RELATED WORK
about that movie more than anything else and more the One of the first works in the field of Sentiment Analysis was
number of reviewers, more we become confident about Pang et Al [1], they tried to find out whether sentiment analysis
whether the movie is good or bad. So, social media sites like can be treated as a topic based categorization (topics being
Twitter as a microblogging service provide its users an area positive and negative sentiments) or a special sentiment
where they can check, collect and see what other people feel classification method has to be developed. They experimented by
about a product or they may even find people with different trying some Machine Learning algorithms on a movie review
or similar sentiments than theirs. This helps people make dataset. While they simply based their experiment on counting
decisions. This type of decision making is not only restricted positive and negative words in a document or in other words,
to a normal person, it is rapidly extending to various and document level classification, some other works like Esuli and
widely divergent fields like in business operations, stocks, Sebastiani, 2006[2] focused on learning polarity of words and
marketing, sales, political or even moral issues. More and phrases. When talking about analysis of Twitter data, some of
more companies are tending towards social media for the major approached were by Kouloumpis et al. [3] who
gathering popular opinion about their products, then they explored the microblogging features including emoticons,
customize their products according to the reviews so that abbreviations and the presence of intensifiers such as all-caps
Copy Right © INDIACom-2017; ISSN 0973-7529; ISBN 978-93-80544-24-3 6053
Proceedings of the 11th INDIACom; INDIACom-2017; IEEE Conference ID: 40353
2017 4 International Conference on “Computing for Sustainable Global Development”, 01st - 03rd March, 2017
th

and character repetitions for Twitter sentiment classification, of the data and not the key part. This gives us all the tweets
they experienced a great success rate when they worked with which are raw, meaning they contain all the slangs,
lexicons tagged with their priority but when they included abbreviations, emoji, links and other irrelevant part of things.
Part of Speech (POS) in their experiments the results
experienced a drop. Barbosa et al.(2010) [4] performed B. Preprocessing
sentiment analysis in two phases. We used Natural Language Toolkit(NLTK) module provided by
First, they classified tweets as objective or subjective and Python Software Foundation, along with this module there
then, the subjective tweets were classified as positive or comes a large set of varied corpora (a body of text), these
negative. They included retweets, hashtags, link, punctuation corpora vary from the bible to political speeches. We then create
and exclamation marks in conjunction with features like two sets of reviews, one positive and the other negative using
prior polarity of words and POS in their datasets. Turney et one of the corpora.
al [5] used bag-of-words method which involves treating We use these documents to create a word list as follows:
words as separate with no relationship in between two words 1. We append a ‘pos’ to all the words present in the
at all. And finally, for sentiment analysis they determined the positive dataset and a ‘neg’ to all the words present in the
sentiment for every word and used some functions to reach negative dataset This will serve as a classification
at a result. Observing this, we came to a conclusion that mechanism (because we created two categories) later for our
POS, lexicon and features like hashtags, links are the most Machine Learning algorithms.
important features for sentiment analysis. Our approach 2. We use part of speech tagging to identify all the
doesn’t involve making a dictionary out of words or adjectives and adverbs in both the datasets.
emoticons as done by Agarwal et Al [6], rather we try to do 3. We, then create a word list merging both the positive
our classification of positive and negative words by using the words and negative words.
corpus provided to us by Natural Language Toolkit (a The review files we used, had a lot of reviews (~10000) and each
module provided by Python). review contained a lot of words (~400), so, there is bound to be a
lot of words which are used less than the others, these words
III. OUR PROCESS according to common language terms are also used less than
other words, for example a person is more habitual of using a
word like ‘great’ rather than ‘insuperable’. So, we identify all
these words using the method called FreqDist present in NLTK
module. This will create a dictionary with the word as key and its
frequency as the value. Next, we separate 5000 most commonly
used words. This will give us a balanced dataset with most
commonly used positive and negative words.
After finishing with preprocessing of these words, we are left
with cleansing and normalizing our tweets. We remove all the
hashtags, emoji and unreadable characters from the tweet. We
also remove links as they don’t count in the sentiment analysis
process. This will give us a relatively clean and normal database.

C. Classification
After we have created a nice database, we then used the Machine
Learning Algorithms to predict and calculate how they fair on it.
Python comes with a module called Scikit-learn (or sklearn).
This module provides prebuilt codes for machine learning
algorithms and we can use them as functions to classify our
documents.
We use our positive and negative review documents as training
Fig. 1. This figure depicts an overview of the process used for project datasets for the algorithms and use similar documents as our
implementation. testing set, using them we measure the algorithm accuracy.
When we classify the documents using a training dataset, a
The detailed steps are as follows: classifier is created, we can save this classifier using pickling so
A. Extraction that we won’t need to train that classifier again.
Tweepy service which we used for our data gathering And then we can use this classifier on our twitter dataset for our
purpose provides the data in JSON (JavaScript Object final results.
Notation) format which actually is a dictionary like data We have used different algorithms, although they are based on
format which is in the form of a key and a value pair. We the same premise: Naïve Bayes Algorithm, Bernoulli Naïve
then obtain all the data which we want using the value part Bayes Algorithm, Multinomial Naïve Bayes Algorithm.
Copy Right © INDIACom-2017; ISSN 0973-7529; ISBN 978-93-80544-24-3 6054
A study on live Sentiment Analysis using Twitter data

1) Naïve Bayes Algorithm: It uses Bayes’ theorem and its D. Testing


most prominent feature is that it assumes independence We then create a classifier of our own, this classifier will run the
between features of a variable. For instance, if a testing data on all the pickled classifiers, it will check which
particular thing has 5 different features, this algorithm classifier gave the most accurate result and will then return the
will assume that they are all independent of each other accuracy of that classifier.
although their might be correlation between them.Naive After this, we performed its final test on the Twitter data we
Bayes classifier is very efficient, scalable and favored wanted to analyze. We create a separate script containing the
form of classification primarily because of its high low code to access the pickles and our classifier and another one to
dependence on computational power and it can also live stream tweets from the website. Sentiment analysis on these
work with low amount of training data. tweets are performed by calling the other script from this script.
Here are some of the results (Figure 2) we got after running the
2) Multinomial Naïve Bayes Algorithm: script:
This algorithm is used for analysis ofmultinomially
distributed data. Here, the training samples represent the
frequencies associated with the events.Therefore, it
takes into account the number of occurrences of a class
variable including multiple occurrences.
3) Bernoulli Naïve Bayes Algorithm:
It is used for datasets which are distributed according to
multivariate Bernoulli distributions. It also identifies
each feature as Boolean (1 or 0). This differs from the
multinomial distribution because it takes into account
the non-occurring features, which means that if one
feature is missing then Bernoulli distribution will
penalize it unlike Multinomial system.
It also does not count multiple occurrences of words
which results in a lot of error prone results and thus low Fig. 2. This picture shows the results obtained on executing the final program on
a live twitter feed.
accuracy.
4) Logistic Regression:
It is a predictive, statistical method for analysis used
when the dataset is binary.
It is primarily used for fitting a model so that a relation
can be described between a binary variable and some
independent variables. It utilizes logistic cumulative
density function for this purpose. It actually is very
useful in prediction datasets as it forces the output to be
between 0 and 1 which helps to find probabilistic
values.
5) Support Vector Clustering:
Clustering deals with data organization [10]. It basically
deals with dividing data into groups so that they can be
more organized and useful. So, basically clustering data
can be performed in many different ways, one is
hierarchical clustering where groups are divided using
their similarity, the more commonly used way is
Support Vector Clustering, it identifies certain low used
points in the datasets and names them as “valleys”, these Fig. 3. This picture is a snippet of the live program execution.
“valleys” are then treated as boundaries partitioning
In the results, the tweets text is presented is first, then a keyword
main data containing regions.
‘pos’ for positive tweet and ‘neg’ for negative tweet followed by
A kernel function is used for this purpose which may
the confidence measure of the analyzing algorithm.
take different parameters and create a map linking the
After this, we then tried to plot the results obtained on a graph
data to a high dimensional feature space. This high
for better visualization, the graph on the y-axis shows the
dimensional feature space has a sphere which when
positivity or the negativity of the tweet and on the x-axis are the
mapped back to the data space gives us a closely bound
tweets so every point on the graph represents a tweet with its
region of contours. This region is then named as a
cluster.
Copy Right © INDIACom-2017; ISSN 0973-7529; ISBN 978-93-80544-24-3 6055
Proceedings of the 11th INDIACom; INDIACom-2017; IEEE Conference ID: 40353
2017 4 International Conference on “Computing for Sustainable Global Development”, 01st - 03rd March, 2017
th

sentiment value. We created a live streaming graph using however underperforms in comparison to its derivatives
matplotlib (a python module). Bernoulli Naïve Bayes and Multinomial Naïve Bayes.
Logistic Regression is more commonly used when the data
points are large in number. So, for example here we have a small
collection of dataset for each positive and negative category, this
hampers its accuracy. It is more stable when the sample size is
large according to [1-14].

CONCLUSION
Sentiment Analysis is rapidly gaining momentum as one of
the leading technologies in the emerging world. This paper tried
to find out how it would fair in a real life scenario, for example
during a live debate of presidential elections, how people react to
different point of views.
We use algorithms such as Support Vector Machines and
Logistic Regression but after studying previous works [1-9] we
found that they are not much better than Naïve Bayes. Thus, it
can be concluded that the need for a better algorithm for
Fig. 4. Using Tweepy, we can search for tweets which all contain a
sentiment analysis is the need for the hour.
particular keyword, in this particular example we used the term ‘Modi’.
V. FUTURE SCOPE
IV. RESULT Different data extraction techniques can be utilized.
The experiment we tried to conduct was to analyze how Apart from Twitter, multiple data sources and data filtering
different algorithms fair when they are put against a new techniques can be used to find out relevant results.
type of word list and live twitter stream. We checked
multiple algorithms from Logistic Regression, Support We have not used a purely correct context, for example in the
Vector Clustering to old Naïve Bayes and its variants. dataset we use for training, in the movie review dataset a term
‘hit’ can be used for a movie which was very successful, but in
The obtained results along with the accuracy obtained for the context of tweets, a person might be using ‘hit’ in a negative
different algorithms are given below: sense like depicting an accident. We aim to remove this
Table 1: Results Obtained. limitation.
Algorithms Applied Acccuracy obtained REFERENCES
Naïve Bayes 65-70% [1] Bo Pang, Lilllian Lee, Shivakumar V, “Thumbs up? Sentiment
Bernoulli Naïve Bayes 67-69% Classification using Machine Learning Techniques”, Proceedings of
EMNLP2002, pp. 79–86.
Multinomial Naïve Bayes 70-75% [2] Esuli, A., and Sebastiani, F.. SentiWordNet: A publicly available lexical
Logistic Regression 65-68% resource for opinion mining. In Proceedings of LREC, 2006.
Support Vector Clustering 60-66% [3] Efthymios Kouloumpis, Theresea W., Johanna Moore, “Twitter Sentiment
Analysis:The Good the Bad and the OMG!”,Proceedings of the Fifth
As we can see, the results we get using Multinomial International AAAI Conference on Weblogs and Social Media.
Naïve Bayes algorithm are the best in accuracy, due to the [4] Barbosa, L., and Feng, J.“Robust sentiment detection on twitter from biased
and noisy data.” In Proc. of Coling, 2010.
fact that this algorithm was primarily built for text
[5] P. D. Turney, “Thumbs up or thumbs down?: semantic orientation applied
classification itself and tasks like sentiment analysis. It to unsupervised classification of reviews,” in Proceedings of the 40th
essentially takes into count the frequencies of the words used annual meeting on association for computational linguistics, pp. 417–424,
which is not seen with other algorithms. Association for Computational Linguistics, 2002.
[6] Vishal A.K. and S.S.Sonawane, “Sentiment Analysis of Twitter Data: A
The poor accuracy of support vector clustering gives the Survey of Techniques”, International Journal of Computer Applications
reason why it is not preferred for sentiment analysis. In the (0975 – 8887), 2016
way we have created our database and then used it on a [7] Yorick Wilks and Mark Stevenson.. The grammar of sense: Using part-of-
totally different dataset, this method tries to over fit the speech tags as a firststep in semantic disambiguation. Journal of
model and instead of better accuracy its main focus is on NaturalLanguage Engineering, 4(2):135–14, 1998.
covering the whole dataset points. [8] DongSung Kim2 and Jong Woo Kim, “Public Opinion Mining on Social
Media: A Case Study of Twitter Opinion on Nuclear Power1”, Advanced
Naïve Bayes, being the oldest technique is able to Science and Technology Letters,Vol.51 (CESCUBE 2014), pp.224-228.
perform well slightly because of its capability to perform [9] G. Vinodhini, RM. Chandrasekaran, “Sentiment Analysis and Opinion
good because of it being a supervised method of learning. It Mining : A Survey”, International Journal of Advanced Research in
Computer Science and Software Engineering,Volume 2, Issue 6, June 2012.
Copy Right © INDIACom-2017; ISSN 0973-7529; ISBN 978-93-80544-24-3 6056
A study on live Sentiment Analysis using Twitter data

[10] Ben Hur, David Horn, Hava T. Siegelman, Vladimir Vapnik, “Support
Vector Clustering”, Journal of Machine Learning Research 2 (2001)
125-137, 2001.
[11] B.Liu and L.Zhang ", A survey of opinion mining and sentiment
analysis." Mining text data.Springer US,.415-463, 2012.
[12] Maite Taboada, J.Brooke,M.Tofiloski, K.Voll, M.Stede "Lexicon-
based methods for sentiment analysis."Computational linguistics 37.2,
2012.
[13] Agarwal, Xie, Vovsha, Rambow, Passonneau. Sentiment analysis of
Twitter data. Proceedings of the Workshop on Language in Social
Media, 2011.
[14] Camara, Valdivia, Lopez, Raez, Sentiment Analysis in Twitter,
Cambridge University Press, 2014.

Copy Right © INDIACom-2017; ISSN 0973-7529; ISBN 978-93-80544-24-3 6057

You might also like