Professional Documents
Culture Documents
Project Report
On
Bachelor of Engineering
By
Gadade Dnyaneshwar
(111CP1193)
Khavase Namrata
(111CP1727)
Rao Vidya
(111CP1726)
Project Guide:
CERTIFICATE
This is to certify that the project entitled Analysis & Review using Twitter Text
(111CP1727)
(111CP1726)
Engineering.
ii
Examiners
1.-------------------------------------------2.---------------------------------------------
Date:
Place:
iii
Declaration
We declare that this written submission represents our ideas in our own words and where
others' ideas or words have been included, we have adequately cited and referenced the
original sources. We also declare that we have adhered to all principles of academic honesty
and integrity and have not misrepresented or fabricated or falsified any idea/data/fact/source
in my submission. We understand that any violation of the above will be cause for disciplinary
action by the Institute and can also evoke penal action from the sources which have thus not
been properly cited or from whom proper permission has not been taken when needed.
----------------------------------------(Signature)
-----------------------------------------
----------------------------------------(Signature)
-----------------------------------------
----------------------------------------(Signature)
-----------------------------------------
iv
Chapter 1. INTRODUCTION
Recently, a number of online shopping customers have dramatically increased due to the rapid
growth of e-commerce, and the increase of online merchants. To enhance the customer
satisfaction, merchants and product manufacturers allow customers to review or express their
opinions on the products or services. The customers can now post a review of products at
merchant sites, e.g., amazon.com, cnet.com, and epinions.com. These online customer
reviews, thereafter, become a cognitive source of information which is very useful for both
potential customers and product manufacturers. Customers have utilized this piece of this
information to support their decision on whether to purchase the product. For product
manufacturer perspective, understanding the preferences of customers is highly valuable for
product development, marketing and consumer relationship management.
Since customer feedbacks influence other customer's decision, the review documents have
become an important source of information for business organizations to take it into account
while developing marketing and product development plans.
Among the 2 main types of textual information - facts and opinions, a major portion of current
information processes methods such as web search and text mining work with the former.
Opinion Mining refers to the broad area of natural language processing, computational
linguistics and text mining involving the computational study of opinions, sentiments and
emotions expressed in text. A thought, view, or attitude based on emotion instead of reason is
often referred to as a sentiment. Hence, an alternate term for Opinion Mining, namely
Sentiment Analysis. This field ends critical use in areas where organizations or individuals
wish to know the general sentiment associated to a particular entity - be it a product, person,
public policy, movie or even an institution. Opinion mining has many application domains
including science and technology, entertainment, education, politics, marketing, accounting,
law, research and development. In earlier days, with limited access to user generated opinions,
research in this field was minimal. But with the tremendous growth of the World Wide Web,
huge volumes of opinionated texts in the form of blogs, reviews, discussion groups and
forums are available for analysis making the World Wide Web the fastest, most
comprehensive and easily accessible medium for sentiment analysis. However, finding
opinion sources and monitoring them over the Web can be a formidable task because a large
number of diverse sources exist on the Web and each source also contains a huge volume of
information. From a humans perspective, it is both difficult and tiresome to find relevant
sources, extract pertinent sentences, read them, summarize them and organize them into
usable form. An automated and faster opinion mining and summarizing system is thus needed.
2.2 Motivation
In recent years, we have witnessed that opinionated postings in social media have helped
reshape businesses, and sway public sentiments and emotions, which have profoundly
impacted on our social and political systems. Such postings have also mobilized masses for
political changes such as those happened in some Arab countries in 2011. It has thus become a
necessity to collect and study opinions on the Web. Of course, opinionated documents not
only exist on the Web (called external data), many organizations also have their internal data,
e.g., customer feedback collected from emails and call centers or results from surveys
conducted by the organizations.
Due to these applications, industrial activities have flourished in recent years. Sentiment
analysis applications have spread to almost every possible domain, from consumer products,
services, healthcare, and financial services to social events and political elections. Many big
corporations have also built their own in-house capabilities, e.g., Microsoft, Google, HewlettPackard, SAP, and SAS. These practical applications and industrial interests have provided
strong motivations for research in sentiment analysis.
3
Now the proposed system deals with the language of English. The purpose behind selecting
the English language is because it is widely used and universally known. Most of the
customers would express their opinions using English language. But the problem is that the
contents that are available on web that is, the recommendations and views expressed by the
internet users are unstructured or semi-structured. In simple terms we can say that the use of
English language can be grammatically incorrect with chances of spelling mistakes and wide
use of shortcuts. Technically we can phrase these by calling them formal opinions and
informal opinions. Where formal opinions refer to the use of proper English with no
grammatical mistakes and informal refers to the improper use of English with grammatical
mistakes, involving use of words which are not standard spellings. They say that the UK
English is said to be formal English and US English is considered to be informal English.
Most of the customers use US English, hence it is very important to concentrate on these
opinions too, or else they might be discarded as noise. Hence the system proposed deals with
this aspect of informal reviews.
Our work is partly based on and closely related to opinion mining and sentence sentiment
classification. Extensive research has been done on sentiment analysis of review text and
subjectivity analysis (determining whether a sentence is subjective or objective). Another
related area is feature/topic-based sentiment analysis, in which opinions on particular
attributes of a product are determined. Most of this work concentrates on finding the
sentiment associated with a sentence (and in some cases, the entire review). There has also
been some research on automatically extracting product features from review text. Though
there has been some work in review summarization, and assigning summary scores to
products based on customer reviews, there has been relatively little work on ranking products
using customer reviews.
At the document level, sentiment classification of documents into positive, negative, and
neutral polarities is done with the assumption made that each document focuses on a single
object O(although this is not necessarily the case in many realistic situations such as
discussion forum posts) and contains opinion from a single opinion holder. At the sentence
5
At the feature level, the various tasks that are looked at are:
Task1: Identifying and extracting object features that have been commented on in each
review/text.
Task 2: Determining whether the opinions on the features are positive, negative or
neutral.
When both F (the set of features) and W (synonym of each feature) are unknown, all three
tasks need to be performed. If F is known but W is unknown, all three tasks are needed, but
Task 3 is easier. It narrows down to the problem of matching discovered features with the set
of given features F. When both W and F are known, only task 2 is needed.
3.3.1.1 Approach
1. Classification based on sentiment phrases
2. Classification using text classification methods
3. Classification using a score function
6
Disadvantages
Specific features of an object that the author likes or dislikes cannot be identified
Main focus may not be evaluation or review, but still contain a few opinion sentences
The input of this method is known subjective vocabulary and a collection of annotated texts.
The high-precision classifiers find whether the sentences are objective or subjective based on
the input vocabulary. High-precision means their behaviors are stable and reproducible. They
are not able to classify all the sentences but they make almost no errors.
Then the phrases patterns which are supposed to represent a subjective sentence are
8
extracted and used on the sentences the HP classifiers have let unlabeled.
The system is self-improving as the new subjective sentences or patterns are used in a loop
on the unlabeled data.
This algorithm was able to recognize 40% of the subjective sentences in a test set of 2197
sentences (59% are subjective) with a 90% precision. In order to compare, the HP subjective
classifier alone recognizes 33% of the subjective sentences with a 91%precision.Along this
original method, more classical data mining algorithm are used such as the nave bayes
classifier.
The general concept is to split each sentence in features -- such as presence of words, presence
of n-grams, and heuristics from other studies in the field -- and to use the statistics of the
training data set about those features to classify new sentences. Their results show that the
more features, the better. They achieved at best a 80-90%recall and precision classification for
subjective/opinions sentences and a 50% recall and precision classification for objective/facts
sentences. The sentence-level sentiment classification methods are improving, this results
from research studies in 2003 show that they were already quite efficient then and that the task
is possible.
Format 1 - Pros, Cons and detailed review: The reviewer is asked to describe Pros and
Cons separately and also write a detailed review. Epinions.com uses this format.
Format 2 - Pros and Cons: The reviewer is asked to describe Pros and Cons separately.
C|net.com used to use this format.
Format 3 - free format: The reviewer can write freely, i.e., no separation of Pros and
Cons. Amazon.com uses this format.
The proposed system aims to identify semantic orientations of movie reviews. This task can
be conducted in eight main steps (Shown as Figure 1):
Step1. The first step is to parse the movie reviews by Stanford parser.
Step2. The second step is to extract adjectives and features from reviews with their
grammatical relationships, i.e. to take grammar analysis of reviews. In this step, all adjectives
and the features modified or described by them are found. This step is conducted with
grammatical knowledge.
Step3. This step is to predict the polarity of adjectives. Based on WordNet, all adjectives are
divided into five groups. The polarity knowledge library of adjectives is output in this step.
Step4. The similarity between features and topic is computed.
Step5. Based on polarity library and similarity between features and topic, the value of
original score of every noun is produced.
Step6. This step is to produce the value of weighted score.
Step7. In order to predict the semantic orientation of reviews, it must be represented with a
vector. This step is to construct the vector of review with weighted score.
Step8. The last step is to produce polarity labels for reviews. We use two methods machine
learning technology and total weighted score computing to generate polarity labels of reviews
and compare their precision with simple term counting method.
10
11
Admin Login
Input Keyword
Tweets Retreval
Data Preprocessiong
Classification using
NLP
Feature Set
Extraction
QT clustering
4.2 USECASE
DIAGRAM
12
Retrieve Tweets
Admin
Feature Extraction of words
QT clustering
Result Generation
13
14
15
Most of the customers use US English, hence it is very important to concentrate on these
opinions too, or else they might be discarded as noise. Hence the system proposed deals with
this aspect of informal reviews.
The purpose of the analysis is to extract, organize, and classify the information contained in
the required documents. The required document undergoes cleaning for unwanted stop-words
and words not listed in the dictionary. It is then classified as formal or informal depending on
the use of words, which then undergoes the application of POS tags. Thus the feature and its
accompanying opinion is identified and extracted. The output of the proposed system is
feature-opinion pairs from the review documents.
The complete architecture of the proposed opinion mining system, which consists of five
different functional components
document pre-processor,
document parser,
17
18
degree of expressiveness of the opinion. As pointed out in, opinion words and features are
generally associated with each other and consequently, there exist inherent as well as semantic
relations between them. Therefore, the feature and opinion learner module is implemented as
a rule-based system, which analyzes the dependency relations to identify information
components from review documents. For example, consider the following opinion sentences
related to Nokia N95:
(i)
(ii)
(iii)
(iv)
In example (i), the screen is a noun phrase which represents a feature of Nokia N95, and the
adjective word attractive can be extracted using nominal subject nsubj relation (a dependency
relationship type used by Stanford parser) as an opinion. Further, using advmod relation the
adverb very can be identified as a modifier to represent the degree of expressiveness of the
opinion word attractive. In example (ii), the noun sound is a nominal subject of the verb
comes, and the adjective word clear is adjectival complement of it. Therefore, clear can be
extracted as opinion word for the feature sound. In example (iii), the adjective pretty is parsed
as directly depending on the noun screen through amod relationship. If pretty is identified as
an opinion word, then the word screen can be extracted as a feature; likewise, if screen is
identified as a feature, the adjective word pretty can be extracted as an opinion. Similarly in
example (iv), the noun email is a nominal subject of the verb is, and the word Best is direct
object of it. Therefore, Best can be identified as opinion word for the feature word email.
Based on these and other observations, we have defined different rules to tackle different
types of sentence structures to identify information components embedded within them.
such that POS(w1) = POS(w2) = NN_, POS(w3) = JJ* and w1, w2 are not stop-words, or if
there exists a relationship nsubj(w3;w4) such that POS(w3) = JJ*, POS(w4) =NN_ and w3,
w4 are not stop-words, then either (w1;w2) or w4 is considered as the feature and w3 as an
opinion. Thereafter, the relationship advmod (w3; w5) relating w3 with some adverbial word
w5 is searched. In case of presence of advmod relationship, the information component is
identified as < (w1; w2) or w4; w5; w3 > otherwise < (w1; w2) or w4; -; w3 >.
The need to identify and interpret possible difference in the linguistic style of textssuch as
formal or informalis increasing, as more people use the Internet as their main research
resource. There are different factors that affect the style, including the words and expressions
used and syntactical features. Vocabulary choice is likely the biggest style marker. In general,
longer words and Latin origin verbs are formal, while phrasal verbs and idioms are informal
(Park, 2007). There are also many formal/informal style equivalents that can be used in
writing.
The formal style is used in most writing and business situations, and when speaking to people
with whom we do not have close relationships. Some characteristics of this style are long
words and using the passive voice. Informal style is mainly for casual conversation, like at
home between family members, and is used in writing only when there is a personal or closed
relationship, such as that of friends and family. Some characteristics of this style are word
contractions such as wont, abbreviations like phone, and short words. We discuss the
main characteristics of both styles
3.
4.
5.
6.
Feature
Informal Style
Formal Style
Contractions
Use Contractions:
Avoid Contractions:
22
Phrasal
Verbs
e.g., I researched
information about nursing
positions.
Personal/
Impersonal
Pronouns
Informal
Formal
about
approximately
and
in addition
anybody
anyone
ask for
request
boss
employer
but
however
buy
purchase
end
finish
enough
sufficient
get
obtain
go up
increase
have to
must
23
Contractions List
Informal
Formal
arent
are not
cant
cannot
didnt
did not
hadnt
had not
hasnt
has not
Im
I am
Abbreviations List
Informal
Formal
asap
as soon as possible
grad.
graduate
HR
Human Resources
Feb.
February
Lab
laboratory
temp
Temperature
24
Message
Love
Welcome
Please
When
Would
By the way
Because
Chapter 6. Implementation
6.1 Hardware Resources Required:
Hardware Requirements
1
RAM
1GB or more
Processor
OS
Dataset
Review Website
and JavaFX
With its constantly improving Java Editor, many rich features and an extensive range of tools,
templates and samples, NetBeans IDE sets the standard for developing with cutting edge
technologies out of the box.
three parsers.
POS Tag
CC
CD
DT
EX
FW
IN
JJ
JJR
JJS
LS
MD
NN
NNS
NNP
NNPS
PDT
POS
PRP
PRP$
RB
RBR
RBS
RP
TO
UH
VB
VBD
VBG
VBN
VBP
VBZ
Description
coordinating conjunction
cardinal number
determiner
existential there
foreign word
preposition/subordinating
conjunction
adjective
adjective, comparative
adjective, superlative
list marker
modal
noun, singular or mass
noun plural
proper noun, singular
proper noun, plural
predeterminer
possessive ending
personal pronoun
possessive pronoun
Example
and
1, third
the
there is
dhoevre
in, of, like
big
bigger
biggest
1)
could, will
door
doors
John
Vikings
both the boys
friends
I, he, it
my, his
however, usually, naturally,
adverb
here, good
adverb, comparative
better
adverb, superlative
best
particle
give up
to
to go, to him
interjection
uhhuhhuhh
verb, base form
take
verb, past tense
took
verb, gerund/present participle Taking
verb, past participle
taken
verb, sing. present, non-3d
take
verb, 3rd person sing. present takes
28
WDT
WP
WP$
WRB
wh-determiner
wh-pronoun
possessive wh-pronoun
wh-abverb
which
who, what
whose
where, when
Capter 6. CONCLUSION
Emotion and mood often affect user's behavior and his/her interaction with other users.
Positivity and negativity are two important attributes of emotion and mood. Based on our
analysis, we explored several key differences across positive and negative users of online
social networks. We introduced a methodology for tweet partitioning and user partitioning,
which led to seven groups of users from highly negative to highly posItIve. Our findings show
that negative users are not interested in sharing their negativity in social media when
compared neutral and positive users. In addition, positive users are more likely to make
friendships with negative users based on the number of followers and followees. An
interesting result is that negative users retweet much more than positive users. They use
Twitter as a tool for social awareness and they need to gain emotional support. Retweeting
positive tweets makes them more positive. Also, negative users try to avoid any interaction by
replying to tweets, since they believe it can make them more negative. In addition, positive
users also don't interact much with negative users, which means everyone triesto avoid
29
interacting with negative users, perhaps because such interactions make them more negative.
Our findings can beused in developing tools for automatically identifying positive and
negative users based on user activities in Twitter. The key distinguishing criteria are how
many followers and followees a user has, the volume and type (positive or negative) of tweets
a user send out, the volume and type of tweets a user retweets as well as replies to. Our future
work involves building a classifier based on the findings of this paper.
REFERENCES
[1] A. Gluhakl, M. Presser, L. Zhu, S. Esfandiyari, S. Kupschick, "TowardsMood Based
Mobile Services and Applications," EuroSSC. Kendal, vol.4793, pp. 159-174, October 2007.
[2] C.M. Lee, and S.S. Narayanan, "Toward detecting emotions in spokendialogs," IEEE
Trans. Speech and Audio Processing, vol. 13, no. 2, pp.293-303, March 2005.
[3] S.P. Dibris, A. Stagliano, A. Camurri, A. F.O. Dibris, "A set offull-bodymovement features
for emotion recognition to help children affected byautism spectrum condition," IDGEI
International Workshop.Chania, May2013, in press.
[4] L. Rui, S. Wang, H. Deng, R. Wang, K. C. Chang, "Towards social userprofiling: unified
and discriminative influence model for inferring homelocations,". In LinkKDD. Beijing, pp.
1023-1031, August 2012.
[5] R. Picard, Affective Computing, M.l.T Media Laboratory PerceptualComputing Section
Technical Report, vol. 321,1995, pp. 1-26.
30