You are on page 1of 48

Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

Chapter 1

INTRODUCTION

This chapter provides an introduction to the study of Sentimental Analysis. It discusses the
background of the study, impact of social networking sties on todays world, social
networking sites as a friend, sentiment analysis of tweets, and the objective of the study.
The scope of the study, its relevance and overall contribution to the knowledge are also
discussed. Finally, the chapter ends with providing an overview to other chapters of the
thesis.

1.1 BACKGROUND OF THE STUDY

Social media is revolutionizing the use of Internet. This is a feeling across the world and it
even if one is not actively using social media. The impact of social media, from blogs to
social networking sites, is felt all over the world where Internet connections are available.
Social media and social sites are connecting people from the different regions and allowing
them to share their experiences freely, thus creating a world of its own. Users are active on
Twitter and other social media in China, South Korea, India, and Latin America, Europe
and United States and active participation rate is rapidly increasing. Nearly 700 million
people are active on Facebook, Twitter and Tumbler creating the third most populous
virtual country. This has resulted in people coming to know what happening in any part of
the world instantly, much before reading about it in the newspapers the next day. Twitter
has been called as the pulse of the planet and is used by many celebrities, country
presidents, political parties, foreign revolutionaries, journalists and geeks. The Pope
himself is using social media to communicate his messages to the world. These days, social
media is being used extensively to either overthrow a government or get them elected.

1 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

1.2 IMPACT OF SOCIAL NETWORKING SITES

The use of social networking sites such as Facebook, Twitter, and YouTube has emerged
as an important electoral campaigning in these countries to create an awareness about the
development and economic situation which has led to the change in the political system. In
the 2012 US election, president Obamas campaign included the use of social media such
as Facebook, Twitter, YouTube and other technologies such as podcast and text messaging.
South Africa, Taiwan, Tunisia, and Turkey - all these places saw the use of Social media
to bring a major change in their political system. The same is true in the recent political
election in India. Prime Minister Modi and his supporters and other activists used the social
media extensively to campaign for election. Understanding the electoral power of social
media could prove very rewarding for political parties, as evidenced by the 2014 national
elections of India. Social media provides a platform for interaction between customers and
marketers. These channels use internet as backbone and Web 2.0 technologies to transform
monologue messages into social media dialogues (many-to-many). Social media are highly
accessible and can reach a large number of audiences within a short span of time. Many
researchers describe social media as a group of internet- based applications that build on
the ideological and technological foundations of Web 2.0, and that allow the creation and
exchange of user-generated content. The real power of social media is that it is social
and all are connected. The rapid growth of online community on Twitter, Facebook,
encourages consumers and marketers to engage in brand discussions, sales and promotions
discussions, and other consumer discussions on the web. A search on a product would result
in display of user content and their opinions.

Social media is lately emerging as an important tool in marketing and brand building.
Social media marketing program focuses on understanding customers, building
relationships and promoting the brand. Social media with the aid of Web 2.0 technologies
have transformed the relationship between customers and sellers. More than ever before,
the customers are able to interact with other customers and companies through social
media. Unlike traditional marketing in which the sellers are able to sell the products through
advertisement, promotions and controlling the purchase decisions, the social media enables

2 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

the customer to participate by sharing their experience with others thereby influencing
others purchase decisions. Customers on social media are able to converse, share and
influence other customers. Companies can also take advantage of the social media by
actively listening to what customers are saying and formulating the strategy to meet the
customer expectations. Currently, the most popular social media sites are Facebook,
Twitter, YouTube and LinkedIn. Each of these tools has different focus. Twitter can be a
powerful marketing tool if used regularly. Since the messages are short, if you tweet
regularly, potential customers can learn about your business and follow your company to
learn about the updates. The growing number of users on Twitter and Facebook are
exposing companies and influencing a huge number of consumers. The social media has
changed the focus of marketing from a supplier to customer perspective. Customers
are controlling the flow of marketing information, not companies, as they are able to share
information on the social web. The Web is a virtual environment where customers are able
to experience products before buying . As a result many companies are adopting social
media as their marketing strategy to gain business values by creating brand awareness,
building reputation, improving customer satisfaction and retention.

1.3 SOCIAL NETWORKING SITES LIKE A VIRTUAL FRIEND

Electronic word-of-mouth (e-WOM) has profoundly changed the way information has been
transmitted across the globe and transcended the traditional medium of marketing and
communication. Many studies have shown the importance of e-WOM communication as a
widely accepted form of communication in marketing. Online product reviews, feedback,
opinions provide valuable information for not just other consumers but also the companies
to monitor consumer attitude towards their products and adopt the feedback effectively.
Studies also have shown the WOM communication affects consumer behavior towards
purchasing decisions. The consumer created information is helpful for decision-making on
purchases because it provides the consumer an idea about others experiences. There are
evidences that consumer reviews have become important for product sales. Opinions are
key influencers of our behavior. Our beliefs and perceptions of reality are conditioned by
how others see the world. Whenever we need to make decisions, we often seek opinions of

3 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

others. In the past, individuals sought opinions from friends and family; and organizations
use surveys, focus groups, opinion polls, etc. With more than 700 million people using
online social media, such as Facebook and Twitter, companies are using this as an
opportunity to reach people. Today, a huge amount of information is available on the web,
which is rich in content and is a useful source of marketing intelligence, for mining
opinions, views, moods, and attitudes. For example, whether a product is good or bad, or
the service experience is positive or negative, or how the public responds to an event or
political movement.

1.4 SENTIMENT ANALYSIS

Sentiment analysis (also known as opinion mining) is the process of detecting sentiments
expressed in a given text. These sentiments can be classified into positive or negative or
others. The sentiments can be found in the feedback, reviews, or critiques provided by
customers in different forums. Sentiment analysis provides companies an estimate on how
their products have been accepted in the market and also to determine how to improve their
product quality to satisfy the existing customers and attract new ones. The aim of the
sentiment analysis is to determine the attitude of an author with respect to a given topic.
Under conventional circumstances, it is very difficult to find out why a consumer did not
buy a product, but, with the help of sentiment analysis tools, it becomes easier to find out
the reasons and logic behind a customer not buying the product. Apart from product and
marketing, sentiment analysis is found to be useful in areas such as politics, sociology, and
psychology. Academicians have recognized the importance of online sentiments and its
impact on marketing. There are a number of research studies to understand the sentiments
of web data that have emerged since 2002. Similarly there are also a number studies in the
area of Hollywood movies and the factors influencing the success of movies. Some Studies
have shown the influence of online product reviews and consumer purchase decisions [7].
They have looked at the relationship between box office revenues and online movie
reviews. Although these studies have compellingly established the significance of online
reviews as an influencer of box office revenues, they do not offer concrete model that the
companies can use in their decision making mechanisms.

4 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

1.5 SENTIMENT ANALYSIS OF TWEETS

Tweets are relatively a new type of e-WOM in recent days. Twitter is a micro- blogging
site founded in 2006. It allows people to post their thoughts in a text form using just 140
characters. These posts are popularly known as tweets, which may include texts, and URLs.
Today, Twitter has more than 300 million users. There are a billion tweets being sent every
three days which reaches millions of people. People are using Twitter to track news, to find
out what others are talking about, the latest in politics, technology, events around your city,
the latest phone or gadget, jobs in your area, and to find out what people are talking about
the latest movies. Twitter allows sharing of the latest information, news, ideas and also
solicits suggestions or ideas instantaneously across the globe unlike ever before. Some
users may just be active listeners and some may be actively participating and exchanging
information. Tweets may also contain peoples opinions, views or experiences relating to
a product, services or brand and these tweets are available to the public. Businesses,
academicians and researchers have a huge amount of Twitter data to analyse. Twitter, if
used effectively, can be an extremely useful tool for businesses to attract customers,
increase traffic, and generate more leads. Because of Twitters ability to reach a large
number of users with a short message, Twitter is the most popular micro blogging site used
today for public relations, advertising and marketing campaigns. The businesses can build
a strong customer base easily and influence millions of people instantaneously. Businesses
can use the Twitter channel to listen to what the customers are talking about their products,
brands, services and other marketing campaigns. The businesses can build a strong
customer base easily and influence millions of people instantaneously. Businesses can use
the Twitter channel to listen to what the customers are talking about their products, brands,
services and other competitors brands as well. For businesses, it also can provide a rich
canvas for broadcasting promotions, services, sales and connects them to their customers
directly. Sharing useful information, having a great business profile gives brand credibility
on social media and it can easily garner thousands of followers following the brand updates.
Twitter also offers promotions and advertisements to get the businesses more fans. These
days, every business has a Twitter sharing icon button embedded in the home web site to
extend their reach. Businesses can also influence by participating directly and

5 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

understanding the customer needs as both are part of Twitter. The companies can get
feedback of the products from their customers that could then be used for improving their
products and services. Sentiments on a product may vary from one place to another or one
country to another country depending on the local market, local culture or many other
factors. This is useful information for the companies for formulating their marketing
strategy. Twitter sentiments can play an important role in making purchase decisions for
the customer. A huge amount of Twitter data can provide sentiments of people on products,
services and trends in the market. Since tweets are short and unstructured, understanding
and analysing tweets is quite difficult and challenging. However, the Twitter platform is
used extensively by people across the globe and is a good source of information for
sentiment analysis; Twitter users base is large and varies from a regular user to celebrities,
politicians, company CEOs and even country presidents and prime ministers; Twitter user
base is growing and provides a good sample of the tweets for sentiment analysis. The movie
industry is a business with high revenues. The sector generated 522 billion U.S. dollars in
revenue in 2013. Hollywood film industry is one of the biggest players in the overall
entertainment sector. A single movie with a very minimal budget can generate billions of
dollars in revenue. Likewise, a movie with a millions of dollars as production budget can
become a colossal failure.

The success / failure of a movie depend on various parameters actors, directors, launch
dates, theatres, and movie reviews etc. To assist the consumers, there are a number of online
movie sites which provide information such as the movie schedule, details of movies,
details of actors and directors, release dates, expert reviews, and user reviews. IMDb.com,
boxofficemojo.com, Yahoo! Movie, RottenTomotoes.com, and many other online sites
provide a number of useful information about the Hollywood movies. The studies have
shown many movie goers review online information before watching the movies. Before
watching a movie, many movie goers would read his reviews first before deciding to watch
the movie. When the Steven Spielberg movie, The Terminal was released, on the very first
weekend, the movie made $19 million dollars which was comparable to other Tom Hanks
movies. But, eventually, the movie flopped in the box office with a total collection of only
$75 million. The reports cited that unfavourable online word-of-mouth from customers,

6 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

and online reviews, was the reason for the flop .When the movie Jack Reacher starring
Tom Cruise was released in North America in the theatres on 21st December 2012, it did
not perform as expected and managed to garner just $80,070,736 in the US market against
a budget of $60 million. Because of its poor performance, the decision to have more sequels
was scrapped. But the film went on to do well in the overseas market and generated a box
office collection of $138,269,859 from the overseas market taking its worldwide collection
to $218,340,595. The studio reconsidered it decision of not going ahead with sequels and
now there are talks of another Jack Reacher film in the offing starring Tom Cruise again
as the detective. This example goes on to show that Hollywood movies are watched in
many other countries besides America and it becomes imperative for the Hollywood
studios to focus their marketing strategies worldwide. This study reiterates this point that
sentiments from all over the world need a focus in the future for any company to grow their
customer base. Many of the recent research have used IMDb.com, boxofficemojo.com, and
Yahoo! Movie sites to study the effect of e-WOM on movie box office revenues and the
results are mixed. Based on the literature review, it can be concluded that there are hardly
any studies that deal with the study of Twitter sentiments across different parts of the world
on any one product. The choice of studying Twitter sentiments across different parts of the
world on Hollywood movies makes the study unique.

1.6 SUMMARY OF THE CHAPTER


In this Chapter firstly the Background of Study of Sentiment Analysis has discussed, than
Impact of the social networking sites on todays world has discussed, how social
networking sites gaining popularity day by day, than Social Networking Sites as a Virtual
friend has discussed, how social networking sites providing us the view about a particular
product, or they are giving us opinion same as the friends or parents or experts gives. And
then finally it gives organization of thesis.

Against this background, the objective of this research is to define and create a mechanism
that will help analyses the tweets and understand the sentiments expressed by people about
an product or political review or any other type of review across different parts of the world.

7 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

The analysis should be able to provide the pattern/trend of a product, in a particular location
or geography, and help the companies to take strategic business.

The Purpose of doing Sentiment Analysis of Twitter to find out the people opinion about a
certain product or about a politician or any other type of view user wants to express. This
will be very helpful for us to know the people sentiment

8 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

Chapter 2

LITERATURE SURVEY

This Chapter provides a brief overview about what the work has done yet on sentimental
analysis. This Chapter also provides the overview that for understanding sentimental what
need to understandable and how much done on those.

2.1 TEXT CLASSIFICATION

Text categorization (a.k.a. text classification) is the task of assigning predefined


categories to free text documents. Text classification has applications in many areas such
as spam filtering, email routing, language identification, topic classification and sentiment
classification. Because of the development of electronic and information technologies, the
volume of electronic text files has become too large for people to process manually. It has
brought challenges and opportunities for the development of Natural Language Processing
techniques such as text classification. Text classification techniques can use statistical or
probabilistic algorithms to automatically classify massive electronic text files with
computing technology.

Text classification is also a sub-domain of data classification. However, the text


classification problem has some unique characteristics from the regular data classification
problem. Most regular data classification applications deal with digits or nominal attributes
but text classification applications deal with text data, which includes letters, words or
phrases. The most common way to apply regular data classification techniques to text
classification is to transform the text data into regular numeric data and then to implement
data classifications. For example, we can transform every word appearing in a text dataset
to an attribute and every text document to a vector of binary values which indicates the
occurrences of the words in the document. Nevertheless, the dimensionality of the
transformed digital dataset will still be too large for classification tasks. Even a small text

9 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

dataset can contain more than a thousand distinct words, not to mention the phrases and
longer grams. This problem is called the curse of dimensionality".

Feature selection is a process in text classification. In feature selection process, we select


the features in the text dataset with feature selection algorithms based on the text
classification goal. By only selecting the useful features for classification tasks, the
dimensionality of the text classification dataset can be reduced to a reasonable size.

They are several popular text classification approaches which exhibit efficiency, accuracy
and scalability. They are the Lexicon-based approach, the Naive Bayes approach, the
Bayesian Network approach the Support Vector Machine (SVM) approach and the
Decision Tree approach.

In data classification, there are two kinds of classification, supervised classification and
unsupervised classification. In supervised classification, pre-labelled data are provided and
classification models are trained on the labelled data. Unsupervised classification is a
classification method which does not need pre-labelled data.

2.1.1 Non-Topic Text Classification

According to the objectives of text classification, text classification can be divided into
topic classification and non-topic classification. Topic classification is used to classify
different text files into different topic groups. Topic classification is used in many real
world applications such as the Google search engine, auto-recommendation systems and
library management. In text data, the topics of the text files are highly related to the word
frequency distribution and topic classification applications have shown very good
performance with traditional probabilistic and statistical methods.

Non-topic classification has been developed to classify text files in different groups based
on properties which are not topics, such as genre classification and sentiment classification.
Genre classification has been developed to classify text files into different genre groups
such as classifying them as newspaper or research articles. Sentiment classification has

10 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

been developed to classify text files into different sentiment groups, which are usually
keyed to positive sentiment, negative sentiment and neutral sentiment.

2.2 RELATED WORK IN SENTIMENT CLASSIFICATION

Sentiment mining is a division of text mining, which includes information retrieval, lexical
analysis and many other techniques. Many methods widely applied in text mining are
exploited in sentiment mining as well. But the special characters of sentiment expression
in language make it very different from standard factual-based textual analysis the most
important application of opinion mining and sentiment classification has been customer
review mining. There have been many studies recorded on different review sites.

Sentiment classification has become very popular research area in recent years not only
because it is more difficult than other text classification problem but also because it has
wide applications in real world. For example, customer review sentiment classification can
be very important to online sales stores such as Amazon.com.

The simplest way to do sentiment classification is using the lexicon-based approach which
calculates the sum of the number of the positive sentiment words and the negative sentiment
words appearing in the text file to determine the sentiment of the text file. Intuitively, it is
supposed to perform well since people do use sentiment words to express.

11 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

However, it does not work as well as we expect considering people do not always express
their feelings in this way. People may use objective words to show sentiments, for example

12 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

Air Canada has seriously tested my patience today. People also may express their
complaints in an ironic way, for example Thank you Delta for having the rudest employees
and almost making me miss my flight.

Rather than categorizing sentiments into three groups, there also have been works that
categorize sentiment into six groups. This work develops an approach for sentiment
classification of tweets about airline services, which is sentiment classification research in
a specific domain and in a specific platform.

In the survey done by, a broad view of sentiment classification methods is discussed,
including the machine learning techniques and traditional classification methods. The
machine learning techniques have widely applied in text classification area and most of
them are supervised learning classification methods. In the supervised learning methods,
two datasets are provided. One is the training dataset and the other one is the test dataset.
The training dataset is used to train the models, in which process the differentiating
characteristics of the documents are identified. The test dataset is used to validate the
performance of the 7 model which is trained by the training dataset. Several machine
learning sentiment classification methods have been developed such as the Nave Bayes
(NB) method, the maximum entropy (ME) method, and the support vector machine (SVM)
method. These text classification methods have shown very good performance in text
categorization.

The Nave Bayes method has been a very popular method in text categorization because of
its simplicity and efficiency. The theory behind is that the joint probability of two events
can be used to predict the probability of one event given the occurrence of the other event.
They key assumption of the Naive Bayes method is that the attributes in classification are
independent to each other, which considerably reduces the computing complexity of the
classification algorithm.

The Support Vector Machine (SVM) method was considered the best text classification
method. The Support Vector Machine method is a statistical classification approach which
is based on the maximization of the margin between the instances and the separation hyper-
plane. This method is proposed by Vapnik.

13 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

Different from other machine learning methods, the K-nearest neighbours (KNN) method
does not extract any features from the training dataset but compare the similarity of the
document with its neighbours. For a document, the KNN classifier finds the nearest
documents and calculates the numbers of the documents in different classes and the
document will be classified to the classes which hold most neighbours.

Many comparative researches have been done for different sentiment classification
approaches, they compared four feature selection approaches and five machine learning
methods on Chinese texts. He concluded that the Information Gain algorithm outperforms
other feature selection approaches and the Support Vector Machine approach works best
in sentiment classification. One researcher also discovered that the Support Vector
Machine approach performs better than the Nave Bayes approach and an N-gram model.

A comparative study on feature selection in text categorization by Songbo Tan, the


Information Gain algorithm outperforms other algorithms in feature selection in text
categorization. In their work, they evaluated the different feature selection algorithms by
applying the features to a K-Nearest Neighbour (KNN) classification model and a linear
regression model. So in our work, we adopted the Information Gain algorithm to select
features for sentiment classification.

Prabowo and Thelwall combine the ruled-based classification and the machine learning
methods, and proposed a hybrid method. Their method yielded satisfactory results when
applied to movie reviews, product reviews and Myspace comments.

Li, Feng and Xiao used a multi-knowledge based approach in mining movie reviews and
summarizing sentiments, which proved very effective in applications. Ding, Bing and
Philip proposed a holistic lexicon based approach to classify customer' sentiments towards
certain products and achieved high accuracy. This approach is content dependent and needs
to select feature words, phrases from training data.

Lin and He proposed a probabilistic modelling framework called Joint-sentiment model,


which adopted the unsupervised machine learning method .In their research, they applied
their model in movie reviews and classify the review sentiment polarities.

14 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

The ensemble classification approach is a combination of different classification


approaches and classifies the documents based on the classification output with the
majority vote method. Rui Xia build an ensemble sentiment classification model which
integrates two feature sets and three sentiment classification approaches. He adopted the
features based on the Part-of-Speech tags and the features based on the word relations, and
the classification method are the Naive Bayes method, the Maximum Entropy method and
the Support Vector Machine method.

2.3RELATED WORK IN TWITTER SENTIMENT CLASSIFICATION

Depending on what text files are used to apply sentiment classification, sentiment
classification can be categorized to many different specific application groups, such as
movie review sentiment classification, product review sentiment classification, blog
sentiment classification and social network sentiment classification and so on.

Movie review, and product review sentiment classification apply to reviews or comments
on certain objects and services. Because these sentiment classification techniques can be
applied to many real world companies such as Amazon, there have been much research
work on review sentiment classification. Blog and social network sentiment classification
are applied to the posts that are published on the Internet. Unlike reviews sentiment
classification, these sentiment classification work is not about feedback toward certain
products or service but can be the authors opinions about anything. Many approaches have
been developed for blog sentiment classification and social network sentiment
classification.

They are many different social network platforms such as Facebook, Twitter and Instagram.
They have their own unique characteristics from each other and different sentiment
classification approaches have been developed for them . For example, Twitter allows users
to post no more than 140 characters for each post, which makes Twitter sentiment
classification different from other text sentiment classification because many text files like
blogs are much longer than 140 characters. Many techniques used in text file sentiment
classification do not perform well in Twitter sentiment classifications because of its length

15 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

restrictions. For example, Information retrieving and summarization approaches that


perform well in paragraph sentiment classification are not very useful for twitter sentiment
classification because there is not much information to retrieve and summarize to classify
its sentiment. Besides that, traditional and simple classification approaches such as the
Lexicon-based approach also perform better in long length text files than in tweets because
there are much higher 10 probabilities to see sentiment words appearing in long paragraphs
than in tweets, which are limited to 140 characters.

Because Twitter provide public access to its streaming and historical data, it has become a
very popular data source for sentiment analysis and much work has been done in this area.

J.Read used emoticons, such as :-) and :-(, to collect tweets with sentiments and to
categorize them into positive tweets and negative tweet. They adopted Naive Bayes
approach and the Support Vector Machine approach, both of which reached accuracy up to
70% .

In the research of Wilson et al, they used hashtags to collect tweets as the training dataset.
They tried to solve the problem of wide topic range of tweet data and proposed a universal
method to produce training dataset for any topic in tweets. Besides that, Wilson et al. also
considered three polarities in tweets sentiment classification, which includes positive
sentiment, negative sentiment and neutral sentiment. Unigrams, bigrams and POS features
were taken into account as classification features, and emoticons and other non-textual
features were also considered. In their experiments, it showed that training data with
hashtags could train better classifiers than regular training data do. But in their research,
the dataset were from libraries and they neglected the fact that tweets with hashtags are
only a small part of real world tweets data.

Pak and Paroubek proposed an approach, which can retrieve sentiment oriented tweets from
the twitter API and classify their sentiment orientations. From the test result, they found
that the classifier using bigram features produces highest classification accuracy because it
achieves a good balance between coverage and precision. Their work in tweets sentiment
mining is not domain specific, which means applying their methods in domain specific
mining will yield different results. Besides that, the data source is biased as well because

16 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

they retrieved only the tweets with emoticons and neglected all other tweets that didnt
contain emoticons, which are the majority of 11 tweets. In this work, they didnt consider
the existence of the neutral sentiment and classifying these tweets is very important for
tweet sentiment analysis.

The challenges in twitter sentiment classification not only come from the fact that each post
is not allowed to exceed 140 characters but also because the sentiment of the tweets can be
very dependent on the scenarios the users are involved in but the context of the scenarios
is not provided in the tweets. For example, Cancelled again, Its the fourth time can be a
tweet with negative sentiment if it is about taking flights but also can be a neutral sentiment
tweet if it is talking about the user frequently cancelling some subscriptions. Because of
this, Twitter sentiment classifications are very domain dependent.

In sentiment classification, features are important because they are the attributes that
determine texts sentiments. Features can be unigrams which are words, or N-grams.
Twitter sentiment classifications are domain dependent because those features are domain
dependent, and sentiment features in one domain may not be sentiment features in other
domains at all [29]. For example, in the stock market area, the word bear means negative
sentiment since it is a term describing bad performances in the stock market but it means
no sentiment at all in most other domains. So the unigram bear can be extracted as a
feature in the stock market area but not in other areas such as airline services.

There have been several works about twitter sentiment classification, and most of them are
not domain dependent. Researchers have been trying to develop approaches to classify
twitter sentiment in a general way but have not achieved an outstanding result.

In Paper by Mr. Hasan Saif they try to explore the areas of sentiment analysis of social
media. In his paper, they introduce a novel approach of adding semantics as additional
features into the training set for sentiment analysis. For each extracted entity (e.g. iPhone)
from tweets, we add its semantic concept (e.g. Apple product) as an additional feature,
and measure the correlation of the representative concept with negative/positive sentiment.
They apply this approach to predict sentiment for three different Twitter datasets. Our
results show an average increase of F harmonic accuracy score for identifying both negative

17 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

and positive sentiment of around 6.5% and 4.8% over the baselines of unigrams and partof-
speech features respectively. They also compare against an approach based on sentiment-
bearing topic analysis, and find that semantic features produce better Recall and F score
when classifying negative sentiment, and better Precision with lower Recall and F score in
positive sentiment classification.

Sentiment analysis has been handled as a Natural Language Processing task at many levels
of granularity. Starting from being a document level classification task, it has been handled
at the sentence level and more recently at the phrase level.

Microblog data like Twitter, on which users post real time reactions to and opinions about
everything, poses newer and different challenges. Some of the early and recent results
on sentiment analysis of Twitter data are by and Pak and Paroubek . They use distant
learning to acquire sentiment data. They use tweets ending in positive emoticons like :)
:-) as positive and negative emoticons like :( :-( as negative. They build models using
Naive Bayes, MaxEnt and Support Vector Machines (SVM), and they report SVM
outperforms other classifiers. In terms of feature space, they try a Unigram, Bigram model
in conjunction with parts-of-speech (POS) features. They note that the unigram model
outperforms all other models. Specifically, bigrams and POS features do not help.

Pak and Paroubek collect data following a similar distant learning paradigm. They perform
a different classification task though: subjective versus objective. For subjective data they
collect the tweets ending with emoticons in the same manner as Go.etal . For objective data
they crawl twitter accounts of popular newspapers like New York Times, Washington
Posts etc. They report that POS and bigrams both help (contrary to results presented by.
Both these approaches, however, are primarily based on gram models. Moreover, the data
they use for training and testing is collected by search queries and is therefore biased. In
contrast, we present features that achieve a significant gain over a unigram baseline. In
addition we explore a different method of data representation and report significant
improvement over the unigram models. Another contribution of this paper is that we report
results on manually annotated data that does not suffer from any known biases. Our data is
a random sample of streaming tweets unlike data collected by using specific queries. The

18 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

size of our hand-labelled data allows us to perform cross validation experiments and check
for the variance in performance of the classifier across folds.

2.4 ISSUES & CHALLENGES IN SENTIMENT ANALYSIS OF


TWEETS

Sentiment Analysis is an essential processing task for personalized services that aim to
exploit textual content such as micro blog messages and the social tags generated in
social streams, since they usually reflect the users subjectivity, in terms of opinions and
sentiments for certain issues and topics. For such purpose, in addition to the fundamental
sentiment analysis problems such as entity and opinion recognition, and sentiment
polarity estimation, there are aspects that have to be taken into account. User-generated
content in social streams presents a number of interesting phenomena, namely opinion
spam, user reputation, irony, sarcasm, and emotion dynamics. If we intend to address these
issues, we have to go beyond classic text-based opinion mining techniques. Opinion spam
is aimed to disturb the normal behavior in social media services, especially those integrated
in recommendation and e-commerce systems, by introducing a bias towards a specific
opinion tendency that promotes or demotes an entity (e.g., a product, a service, a brand),
or makes users express reviews and opinions in a certain direction. The identification of
opinion spam represents a crucial problem for opinion mining and sentiment analysis
approaches, which should be able to detect deceptive opinions that try to simulate real user
reviews that increase or harm an entitys reputation. In certain media, such as social
networks and micro blogging platforms, the users responses (e.g. by unfollowing contacts,
and posting complaint comments) may represent a valuable source of information to detect
spam content. The writers reputation is another important aspect of sentiment analysis of
user generated content. From the point of view of a review site, the higher the reputation
of a review author, the more reliable the review can be to other customers, and sometimes
vice versa: A review that is seen as reliable by the users can provide high reputation to its
author. In this sense, determining the reputation of the authors of a content can be helpful
for opinion spam detection. Given the subjective nature of user-generated content, another
relevant phenomenon is the existence of irony or sarcasm in the texts. This can constitute
19 Prepared By: Shubham Singhal, AICE
Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

a serious problem for many tasks in Sentiment Analysis, like detection of subjectivity and
the classification of the polarity of a given opinion, since the explicit text content reflects
the opposite of the sentiment really expressed by the writers. Most of published works has
focused on the identification of one-liners (jokes or humorous contents in short texts), but
there are some researches aimed to extract humorous patterns from longer texts. In a
different way, there are also approaches that use results of sentiment analysis in order to
detect humour in texts, for example the (negative) polarity of a text has been taken as a
feature to retrieve patterns of humorous contents and syntactic and semantic features have
been used as indicatives of humour, e.g., semantic ambiguity, the appearance of emoticons,
idioms and slang language, and the abundance/absence of punctuation marks, to name a
few. In the case of social media, certain user responses, such as expressions of amusement
and laughing emoticons, may be used as a source for identifying contents with irony and
sarcasm, which may be difficult to detect if no additional information apart from the
contents themselves exist.

2.4.1 Problem Statement:

1. How to extract the data of Twitter.

2. How to Process the Data of Twitter to make it readable.

3. What kind of Technique can apply on Lexicon Base Sentiment analysis so it can
improve the results

4. How to handle the sarcastic Tweets so The Result / Sentiment Should be according
to user sentiment.

2.4.2 Objective

Sentiment is generally computed at document and/or sentence level. Multiple sentiments,


nonetheless, can be expressed within the same document or the same sentence towards
different targets. For example, the post I love Nexus4 but I dont like Nexus5 at all!
expresses two different sentiments towards two different targets, the Nexus4 and Nexus5
devices. Additionally, when monitoring the sentiment or particular brands, events or

20 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

individuals in social media, sentiment analysis approaches should consider if the sentiment
of the posts referencing the brand, event or individual do indeed express sentiment towards
those entities. For instance, a significant number of negative posts do exist in social streams
mentioning the WWF (the World Wildlife Fund) organization, which do not criticize it, but
the negative impact of climate change, the danger of extinction suffered by a number of
species, and other sustainability issues. Furthermore, approaches in the literature of
sentiment analysis have emerged in the last few years that aim to identify sentiment targets
within a given text, focusing on entity-level and aspect level sentiment analysis detection
i.e., they first identify the entities and events appearing in the text, and then check the
sentiment expressed towards them.

2.5 SUMMARY OF THE CHAPTER

In this chapter firstly Text Classification is discussed, text classification is How Text can
be classified in different predefined categories, than related work in sentiment classification
has discussed, different techniques for sentiment has discussed like Support Vector
Machine, Lexicon Base Analysis etc. After that related work in Twitter sentiment analysis
has discussed, than issue while doing sentiment analysis of tweets has discussed, some
problem has identified which need to be address during propose work.

21 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

Chapter 3

PROPOSED SYSTEM

After analysing the different techniques for sentiment analysis of Twitter, improved
Lexicon base technique for sentiment analysis of tweets is proposed.

Here Lexicon base technique is only to know the sentiment of each word. After this a new
way to find the sentiment analysis of twitter has been proposed so it will improve the
drawback happened in Lexicon base technique. The main disadvantage of Lexicon base
technique is it is not able to handle sarcastic Tweets, so by proposing improvement of
Lexicon base technique handling all that Tweets in an efficient manner has been tried.

There are four Modules in proposed system:

1. Retrieval Module: This Module is used to get the Twitter Data from Twitter with
the help of Twitter API.

2. Pre-processing Module: This Module is used to Process the Twitter Data i.e.
Tokenization, removal of URL, emoticons count.

3. Scoring Module: This Module is just used fetch the sentiment of individual word
from Lexicon Dictionary.

4. Twitter Sentiment Scoring Module: This Module is the heart of the given proposed
System, Here some new formulas have been proposed so that sentiment of Twitter
Data can be calculated.

Details of modules are given in next sections.

22 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

23 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

3.1 RET RIEVAL MODULE

Fig 3. HADOOP
- ARCHITECTURE WITH FLUME

For Retrieve The data from Twitter, We need to make a Twitter API on Twitter. After that
with the help of flume we can retrieve the data of Twitter.

The Retrieve Data stored in HDFS (Hadoop Data File System). We can extract the data
from Hadoop Data System.

24 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

Fig4.- DATA FLOW OF TWITTER DATA STREAMING

25 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

3.2 PREPROCESSING MODULE

It is the second Module in given proposed system Architecture. In this Module the
following Steps need to be done:

1. Remove URL and #tags from given Tweet.

2. Count the Number of Capital Word in given Tweet.

3. Count the Number of Exclamation mark in given Tweet.

4. Correct the spelling (If any )

5. Parse all the emoticons and find the Count of each emoticon.

6. Use form Tagger to break the tweet in form of part of speech and save
the content in form of Verb, adverb and adjective only.

Table 1: EMOTICONS STRENGTH


Emoticon Meaning Strength
:( Sad -0.5

X-( angry -1

:),=),:-) Happy, smile 0.5


:\ undecided 0
\m/ Hi 5 1
</3 Broken -0.5
:* kiss 0.5
:| Straight face 0
BD Big grin with glasses 1
B( Sad heart with glasses -0.5
:( crying -1
:D Big grin 1
XD Laughing 1

26 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

Table2: VERB /ADVERB STRENGTH

Hate -1 More 0.2


Suck -0.9 Not -0.8
Enjoy 0.7 Too 0.6
Excite 0.3 Complete +1
Relax 0.2 Excite 0.3
Detest -0.8 Pretty 0.3
Adore 0.9 Very 0.4
Suffer -0.4 Never -0.9
Hardly -1 Much 0.1
Adore 0.9 Any -0.2
Reject -0.2 Disgust -0.3
Less -0.6 Little -0.4
Dislike -0.7 extremely 0.7

3.3 VERB/ADVERB SCORING MODULE

This module is used to get the sentiment of each word from word list. For this we use the
following algorithm

Procedure determine_orientation (target_Adverb/ Verb Yi , Adverb/ Verb_ seed list)


begin

1. if (yi is in Adverb/verb_seedlist )

2. {yis orientation =yis orientation in adverb/verb_seedlist};

3. else if (yi has synonym x in Adverb/ Verb _ seedlist )

4. { yis orientation= xs orientation;

27 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

5. add yi with orientation to Adverb/ Verb _ seedlist ; }

6. else if (yi has antonym z in Adverb/ Verb _ seedlist)

7. { yis orientation = opposite orientation of zs orientation;

8. add yi with orientation to Adverb/ Verb _ seedlist; }

9. end

3.4 TWITTER SENTIMENT SCORING MODULE

This is the heart of the Proposed System, in this system some formulas has been proposed
for calculate the sentiment of overall Tweet.

S(T)= P(T)* I(T)

Where S(T)= Sentiment of Tweet

P(T)= Polarity of Tweet

I(T)= Influence of Tweet

3.4.1 Polarity of Tweet:

P(T)= + P(VFi) + Yei * P(Ei)


)

Where P(AFi) denotes core of ith adjective from

28 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

P(VFi) denotes the score of ith verb group

Yei denotes the count of ith emotions

P (Ei) denotes the score of ith emotions.

L(R) Denotes the size of opinion groups and emoticons extracted from the
Tweet.

3.4.2 Influence of Tweet I (T)

( ) (
( ( ) ) ( ) )

Where Tic denotes the fraction of Tweets in CAPS

Log(YL) denotes the count of repeated Letter

Log(YX) denotes the count of exclamation Marks

Where there is only one verb/adverb or adjective there we will take 0.5 as the second
value.
If a tweet have value less than -1 than we will it as -1.
If a tweet have value more than +1 than we will take it as +1.
The tweets have values between -0.1 to -1.0 will consider as Negative Tweets.
The Tweets have values between 0.1 to 1.0 will consider as Positive Tweets.
The Tweets between range -0.1 to +0.1 will consider as Neutral Tweets.
29 Prepared By: Shubham Singhal, AICE
Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

If Influence of tweet is zero, than Overall sentiment of tweet will same as polarity of
tweet.

3.5 SUMMARY OF THE CHAPTER

In this chapter a new system for twitter sentiment analysis is proposed, which have
four modules retrieval module, pre-processing module, verb/adverb scoring module
and Twitter Sentiment Scoring Module. In retrieval module tweets are extracted
from twitter. In pre-processing module tweets are processed (Removal of URL,
form Convertor). In Verb/Adverb scoring module the sentiment of each
verb/adverb/adjective is calculated. And in the last module Twitter Sentimental
Scoring Module the overall sentiment of Twitter data has been calculated. This
module have two sub parts polarity of tweet and Influence of tweet.

30 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

Chapter 4

IMPLEMENTATION & RESULT

4.1 RETRIVAL MODULE (TWITTER API)

4.1.1- Configuring Flume Agent:

The Flume configuration file looks like,

The Flume Agent is configured so that the Twitter data can be streamed into the HDFS
when the Twitter Agent is started. Twitter API key and secret are given as *******in the
above image as they should not be shared. Keyword is given as Google chrome so that it
streams all the tweets with the keyword in it. Many number of keywords can be added if
necessary. The streamed data is stored in the HDFS in the directory created by the path in
the code above. Year, month, day and hour directory will be created automatically within
the HDFS as the data is streamed into it. The directories are created in such way to avoid
confusion and increase performance.

31 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

4.1.2- Starting the Twitter Agent:


The Flume starts streaming the real-time Twitter data after starting the Twitter
Agent.

The Twitter Agent is started by the command below.

4.1.3- Data Storage

32 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

The streamed Twitter data are stored in HDFS. This Twitter data is never
modified or altered in any situation. As the Twitter data is in JSON format, it will
not work with the default setup. Hive Sere. Is used to interpret the data which is
one of the biggest advantages of Hive. For example, Sere takes the following
JSON format tweet.

HDFS Storage (Tweeter Data Snapshot)

33 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

4.2 PREPROCESSING MODULE IMPLEMENTATION & OUTPUT

Let Say from Twitter API We got some sample Tweets one of them is

James, you hate Studying, damn BOOOORING!!! You are totally


unprepared for your exam tomorrow :(:( Things cant be good #exams

Now Pre-processing of This Tweets will be:

Output1: Capital Tweet Count

34 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

Description of Output:
In this Tweet the total no. of words is 18, where only one word BORING is in
Capital letter, so total number of capital letter is 1.
Fraction of Tweet which is in Caps (Tc) will be 1/18=0.5555556

Output2 : Exclamation Mark Count

Description of Output:

Here there are three exclamation marks. So YX=3

35 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

4.3 SCORING MOUDLE IMPLEMENTATION & OUTPUT

Output3 : S coring of Each Adverb/Verb

Description of Output:

After using Form convertor the Tweet is tokenize into group of Adv erb/ Adjective,
using word net Dictionary (Lexicon Dictionary) their sentiment is count. Groups
with individual word sentiment are shown in screenshot.

36 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

4.4 TWITTER SENTIMENT SCORING MODULE OUTPUT

37 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

Output5: Tweet Sentiment Scoring

Output 6:Tweet Sentiment Scoring (Continue)


:
38 Prepared By: Shubham Singhal, AICE
Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

39 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

4.5 USING EXISTING LEXICON ANALYSIS:

40 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

41 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

4.6 COMPARISON OF RESULTS:

1
0.5
0
Existing System
-0.5 Tweet1 Tweet2
Proposed System
-1
-1.5
-2

Fig5: Comparison of Existing System and Proposed System

Existing System Proposed System


The Sentiment score is very high or low Sentiment score is according to the
irrespective to the sentiment of that sentiment of tweet.
tweet.

Not able to give the correct result in case Able to give the result according to user
of sarcastic tweet. sentiment even it is a sarcastic tweet.

4.7 SUMMARY OF CHAPTER


In this chapter, we took two examples, and calculate the sentiment of both tweets by proposed
system and the existing system.

After that result of both systems (existing and propose system) has been compared to show
that proposed system can give better result compare to the existing system.

42 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

Chapter 5

CONCLUSION

5.1 CONCLUSION

In this thesis Work, First of all different techniques for Sentiment Analysis of Twitter has been
studied.

After Improved Lexicon Base Sentiment Analysis for Twitter Sentiment Analysis has been
proposed. As discussed Earlier the existing Lexicon Base Sentiment Analysis Technique is not
able to provide the accurate result. So some new formulas and improved lexicon Base Sentiment
Analysis has been proposed. Which can provide the results according to the User Sentiment?

The result of Existing system and the Proposed system has been compared to show proposed
system is better than the existing system and able to handle sarcastic tweets also.

43 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

Chapter 6

FUTURE WORK

6.1 FUTURE SCOPE

In the propose System , one tweet at one time has taken , to Analyse the Twitter data,
In future we can also automate the whole process which will able to handle large Data
Set at a time, and able to provide Tweet Sentiment of that large data. In future we can
also try to provide the overall sentiment of A large collection of data, like for movie
review , we may able to provide the 1000 users view at a time , Which will very useful
in Opinion Mining.

44 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

Chapter 7

REFERENCES

1. Brogan, 2010; Zarella, 2010). Kaplan and Heinlein (2010) describe social media as
a group of internet- based applications that build on the ideological and
technological foundations of Web 2.0, and that allow the creation and exchange of
user-generated content.

2. Kaplan and Heinlein (2010) describe social media as a group of internet- based
applications that build on the ideological and technological foundations of Web 2.0,
and that allow the creation and exchange of user-generated content

3. sales and promotions discussions, and other consumer discussions on the web
(Algesheimer et al., 2005; Casalo etal., 2011; Pai and Tsai, 2011)

4. The impact of online social networks on consumers purchasing decision --The study
of food retailers(Ayda Darban, Wei Li June 2012)

5. How Large U.S. Companies Can Use Twitter and Other Social Media to Gain Business
Value Culnan et al., 2010)5.

6. Sentiment Analysis: A Combined Approach Rudy Prabowo1 , Mike Thelwall School


of Computing and Information Technology

7. The Effects of Controversial Reviews on Product Sales Performance: The Mediating


Role of the Volume of Word of Mouth KungHsin Shao1 1 College of Management,
National Taiwan University, Taipei, Taiwan Correspondence: KungHsin Shao.

8. unfavorable online word-of-mouth from customers, and online reviews, the reason fo
the flop (Dellarocas et al., 2007).

9. www.statista.com

45 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

10. http://www.scholarpedia.org/article/Text_categorization

11. https://en.wikipedia.org/wiki/Information-based_complexity

12. An Ensemble Sentiment Classification System of Twitter Data for Airline Services
Analysis by Yun Wan

13. Opinion mining and sentiment analysis Bo Pang1 and Lillian Lee2
14. Data Mining Concepts and Techniques Third Edition Jiawei Han University of
Illinois at UrbanaChampaign Micheline Kamber Jian Pei Simon Fraser University

15. Sentiment analysis of blogs by combining lexical knowledge with text classification P
Melville, W Gryc, RD Lawrence

16. Review of Sentiment Classification Methods and Opinion Mining: The Future
Roadmap. Xia, Zong and Li 2011

17. VAPNIK, V., and A. LERNER, 1963. Pattern recognition using generalized portrait
method. Automation and Remote Control, 24, 774780.

18. An empirical study of sentiment analysis for chinese documents Songbo Tan *, Jin
Zhang

19. International Sentiment Analysis for News and Blogs Mikhail Bautin, Lohit Vijayarenu
2005

20. A comparative study on feature selection in text categorization by Songbo Tan (Tan
and Zhang 2008)

21. Twitter as a Corpus for Sentiment Analysis and Opinion Mining Alexander Pak,
Patrick Paroube

46 Prepared By: Shubham Singhal, AICE


Technical Seminar Report, Academic Session 2016-17, AICE, Jaipur

Page No.47 Prepared By: Shubham Singhal, AICE


Page No.48 Prepared By: Shubham Singhal, AICE

You might also like