Analysis & Review of Product Using Twitter Tweets As A Dataset

A
Project Report
On
Analysis & Review of Product using Twitter

Tweets as a Dataset
Submitted in partial fulfillment of the requirements
of University of Mumbai for the degree of
Bachelor of Engineering
By
Gadade Dnyaneshwar
(111CP1193)
Khavase Namrata
(111CP1727)
Rao Vidya
(111CP1726)
Project Guide:
Prof. Chandrashekhar Badgujar
Department of Computer Engineering

Mahatma Gandhi Missions College of Engineering & Technology
Kamothe, Navi Mumbai 400 209
Academic Year: 2015-16
i
CERTIFICATE
This is to certify that the project entitled Analysis & Review using Twitter Text
Mining is a bonafide work of

1. Mr. Gadade Dnyaneshwar (111CP1193)
2. Ms. Khavase Namrata
(111CP1727)
3. Ms. Rao Vidya
(111CP1726)
Submitted to the University of Mumbai in partial fulfillment of

the requirement for the award of the degree of Undergraduate in Computer
Engineering.
(Prof. Chandrasekhar Badgujar)

Project Guide
(Prof. Jagrati Shekhawat)

Project Coordinator
(Dr. Anitha Patil)

Head of Department
(Dr. Santosh K. Narayankhedkar)

Principal
ii
Project Report Approval for B. E.

This project report entitled Analysis & Review using Twitter Text Mining by
BE Students is approved for the degree of Computer Engineering
Examiners
1.-------------------------------------------2.---------------------------------------------
Date:
Place:
iii
Declaration
We declare that this written submission represents our ideas in our own words and where
others' ideas or words have been included, we have adequately cited and referenced the
original sources. We also declare that we have adhered to all principles of academic honesty
and integrity and have not misrepresented or fabricated or falsified any idea/data/fact/source
in my submission. We understand that any violation of the above will be cause for disciplinary
action by the Institute and can also evoke penal action from the sources which have thus not
been properly cited or from whom proper permission has not been taken when needed.
----------------------------------------(Signature)
-----------------------------------------
Gadade Dnyaneshwar (111CP1193)
----------------------------------------(Signature)
-----------------------------------------
(Khavase Namrata (111CP1727)
----------------------------------------(Signature)
-----------------------------------------
Rao Vidya (111CP1726)

Date:
iv
Chapter 1. INTRODUCTION
Recently, a number of online shopping customers have dramatically increased due to the rapid
growth of e-commerce, and the increase of online merchants. To enhance the customer
satisfaction, merchants and product manufacturers allow customers to review or express their
opinions on the products or services. The customers can now post a review of products at
merchant sites, e.g., amazon.com, cnet.com, and epinions.com. These online customer
reviews, thereafter, become a cognitive source of information which is very useful for both
potential customers and product manufacturers. Customers have utilized this piece of this
information to support their decision on whether to purchase the product. For product
manufacturer perspective, understanding the preferences of customers is highly valuable for
product development, marketing and consumer relationship management.
Since customer feedbacks influence other customer's decision, the review documents have
become an important source of information for business organizations to take it into account
while developing marketing and product development plans.
How does Opinion Mining System Works?
Among the 2 main types of textual information - facts and opinions, a major portion of current
information processes methods such as web search and text mining work with the former.
Opinion Mining refers to the broad area of natural language processing, computational
linguistics and text mining involving the computational study of opinions, sentiments and
emotions expressed in text. A thought, view, or attitude based on emotion instead of reason is
often referred to as a sentiment. Hence, an alternate term for Opinion Mining, namely
Sentiment Analysis. This field ends critical use in areas where organizations or individuals
wish to know the general sentiment associated to a particular entity - be it a product, person,
public policy, movie or even an institution. Opinion mining has many application domains
including science and technology, entertainment, education, politics, marketing, accounting,
law, research and development. In earlier days, with limited access to user generated opinions,
research in this field was minimal. But with the tremendous growth of the World Wide Web,
huge volumes of opinionated texts in the form of blogs, reviews, discussion groups and
forums are available for analysis making the World Wide Web the fastest, most
comprehensive and easily accessible medium for sentiment analysis. However, finding
opinion sources and monitoring them over the Web can be a formidable task because a large
number of diverse sources exist on the Web and each source also contains a huge volume of
information. From a humans perspective, it is both difficult and tiresome to find relevant
sources, extract pertinent sentences, read them, summarize them and organize them into
usable form. An automated and faster opinion mining and summarizing system is thus needed.
1.1 Scope of the Project

Opinion mining generally refers to the process of extracting product features and opinions
from review documents and summarizing them using a graphical representation. Generally,
document level opinion mining systems fail to reveal the product features liked or disliked by
the users, rather they classify the reviews as positive or negative. A positive review does not
mean that the opinion holder has positive opinion on all aspects or features of the product.
Similarly, a negative review does not mean that the opinion holder dislikes everything about
the product. Keeping in mind the above facts, feature-based opinion mining is proposed. Since
a complete opinion is always expressed in one sentence along with its relevant feature, the
feature and opinion pair extraction can be performed at sentence-level to avoid their false
associations.
Chapter 2. MOTIVATION & APPROACH

2.1 Problem Statement
As the number of reviews that a product receives may grow rapidly and many times the
reviews may also be quite lengthy, it is hard for the customers to analyze them through
manual reading to make an informed decision to purchase a product. A large number of
reviews for a single product may also make it harder for individuals to evaluate the true
underlying quality of a product. In these cases, customers may naturally gravitate to read a
few reviews in order to form a decision regarding the product and he/ she may get only a
biased view of the product. Similarly, manufacturers want to read the reviews to identify what
elements of a product affect sales most, and a large number of reviews make it hard for
product manufacturers or business organizations to keep track of customer's opinions and
sentiments on their products and services. Since, most of the reviews are stored either in
unstructured or semi-structured format; the distillation of knowledge from this huge repository
becomes a challenging task. It would be a great help for both customers and manufacturers if
the reviews could be processed automatically and presented in a summarized form
highlighting the product features and users opinions expressed over them. So I propose a text
mining approach to mine product features and opinions from review documents.
2.2 Motivation
In recent years, we have witnessed that opinionated postings in social media have helped
reshape businesses, and sway public sentiments and emotions, which have profoundly
impacted on our social and political systems. Such postings have also mobilized masses for
political changes such as those happened in some Arab countries in 2011. It has thus become a
necessity to collect and study opinions on the Web. Of course, opinionated documents not
only exist on the Web (called external data), many organizations also have their internal data,
e.g., customer feedback collected from emails and call centers or results from surveys
conducted by the organizations.
Due to these applications, industrial activities have flourished in recent years. Sentiment
analysis applications have spread to almost every possible domain, from consumer products,
services, healthcare, and financial services to social events and political elections. Many big
corporations have also built their own in-house capabilities, e.g., Microsoft, Google, HewlettPackard, SAP, and SAS. These practical applications and industrial interests have provided
strong motivations for research in sentiment analysis.
3
Now the proposed system deals with the language of English. The purpose behind selecting
the English language is because it is widely used and universally known. Most of the
customers would express their opinions using English language. But the problem is that the
contents that are available on web that is, the recommendations and views expressed by the
internet users are unstructured or semi-structured. In simple terms we can say that the use of
English language can be grammatically incorrect with chances of spelling mistakes and wide
use of shortcuts. Technically we can phrase these by calling them formal opinions and
informal opinions. Where formal opinions refer to the use of proper English with no
grammatical mistakes and informal refers to the improper use of English with grammatical
mistakes, involving use of words which are not standard spellings. They say that the UK
English is said to be formal English and US English is considered to be informal English.
Most of the customers use US English, hence it is very important to concentrate on these
opinions too, or else they might be discarded as noise. Hence the system proposed deals with
this aspect of informal reviews.
Chapter 3. Related Work

3.1 Overview
4
Our work is partly based on and closely related to opinion mining and sentence sentiment
classification. Extensive research has been done on sentiment analysis of review text and
subjectivity analysis (determining whether a sentence is subjective or objective). Another
related area is feature/topic-based sentiment analysis, in which opinions on particular
attributes of a product are determined. Most of this work concentrates on finding the
sentiment associated with a sentence (and in some cases, the entire review). There has also
been some research on automatically extracting product features from review text. Though
there has been some work in review summarization, and assigning summary scores to
products based on customer reviews, there has been relatively little work on ranking products
using customer reviews.
3.2 Existing System

Existing Systems on feature-based opinion mining have applied various methods for feature
extraction and refinement, including NLP and statistical methods. However, these analyses
revealed two main problems. First, most systems select the feature from a sentence by
considering only information about the term itself, for example, term frequency, not bothering
to consider the relationship between the term and the related opinion phrases in the sentence.
As a result, there is a high probability that the wrong terms will be chosen as features. Second,
words like photo, picture, and image that have the same or similar meanings are treated as
different features since most methods only employ surface or grammatical analysis for feature
differentiation. This results in the extraction of too many features from the review data, often
causing incorrect opinion analysis and providing an inappropriate summary of the review
analysis.
3.3 Level of Opinion Mining

The opinion mining tasks at hand can be broadly classified based on the level at which it is
done with the various levels being namely,
The document level,
The sentence level.
The feature level.
At the document level, sentiment classification of documents into positive, negative, and
neutral polarities is done with the assumption made that each document focuses on a single
object O(although this is not necessarily the case in many realistic situations such as
discussion forum posts) and contains opinion from a single opinion holder. At the sentence
5
level, identification of subjective or opinionated sentences amongst the corpus is done by

classifying data into objective (Lack of opinion) and subjective or opinionated text.
Subsequently, sentiment classification ofthe aforementioned sentences are done moving each
sentence into positive, negative and neutral classes. At this level as well, I make the
assumption that a sentence contains only one opinion which as in our previous levels is not
true in many cases. An optional task is to consider clauses.
At the feature level, the various tasks that are looked at are:
Task1: Identifying and extracting object features that have been commented on in each
review/text.
Task 2: Determining whether the opinions on the features are positive, negative or
neutral.
Task 3: Grouping feature synonyms and producing a feature-based opinion summary

of multiple reviews/text.
When both F (the set of features) and W (synonym of each feature) are unknown, all three
tasks need to be performed. If F is known but W is unknown, all three tasks are needed, but
Task 3 is easier. It narrows down to the problem of matching discovered features with the set
of given features F. When both W and F are known, only task 2 is needed.
3.3.1 Document Level Sentiment Analysis

The binary classification task of labeling an opinionated document as expressing either an
overall positive or an overall negative opinion is called sentiment polarity classification or
polarity classification. This binary decision task is alternatively termed as sentiment
classification. Sentiment polarity forms one of the core problems in opinion mining in the
sense that it forms the general character for a set of problems in this field: given an
opinionated piece of text, where we shall assume that the entire text bears an overall opinion
on a single issue or item, classify the opinion as one of two opposing sentiment polarities and
locate its position on the continuum between these two polarities.
3.3.1.1 Approach
1. Classification based on sentiment phrases
2. Classification using text classification methods
3. Classification using a score function
6
Example (positive Document)

Canons PowerShot SX10 IS is a 10 Megapixel super-zoom camera with a 20x opticallystabilized lens and a 2.5in flip-out screen. It has an excellent resolution. Digital zoom images
are surprisingly good. Announced in September 2008 alongside the higher-end SX1 IS, its the
successor to the best-selling PowerShotS5 IS and retains its main body shape, articulated
screen, a battery power and movies with stereo sound, but within the camera theres been
some major changes. Now, it has better noise control than Canon's previous "S" models.
However, autofocus is very slow.
1) Classify documents (e.g., reviews) based on the overall sentiments expressed by

opinion holders (authors)
Positive, negative, and (possibly) neutral

Since in our model an object O itself is also a feature, then sentiment classification
essentially determines the opinion expressed on O in each document (e.g., review).
2) Similar but different from topic-based text classification.
In topic-based text classification, topic words are important.

In sentiment classification, sentiment words are more important, e.g., great, excellent,
horrible, bad, worst, etc.
3) Mainly at the document-level, but also extendable to the sentence-level.
Disadvantages
It does not give details on what people liked or disliked
Specific features of an object that the author likes or dislikes cannot be identified
It is not easily applicable to non-reviews, e.g. forum and blog postings

7
Main focus may not be evaluation or review, but still contain a few opinion sentences
3.3.2 Sentence-level sentiment Analysis

The sentiment classification at the document-level is the most important field of web opinion
mining. However, for most applications, the document-level is too coarse. Therefore it is
possible to perform finer analysis at the sentence-level. The research studies in this field
mostly focus on a classification of the sentences whether they hold an objective or a
subjective speech, the aim is to recognize subjective sentences in news articles and not to
extract them. The sentiment classification as it has been described in the document-level part
still exists at the sentence-level; the same approaches as the Turney's algorithm are used,
based on likelihood ratios. Because this approach has already been described in this paper, this
part focuses on the objective/subjective sentences classification and presents two methods to
tackle this issue. The first method is based on a bootstrapping approach using learned patterns.
It means that this method is self-improving and is based on phrases patterns which are learned
automatically.
The input of this method is known subjective vocabulary and a collection of annotated texts.
The high-precision classifiers find whether the sentences are objective or subjective based on
the input vocabulary. High-precision means their behaviors are stable and reproducible. They
are not able to classify all the sentences but they make almost no errors.
Then the phrases patterns which are supposed to represent a subjective sentence are
8
extracted and used on the sentences the HP classifiers have let unlabeled.
The system is self-improving as the new subjective sentences or patterns are used in a loop
on the unlabeled data.
This algorithm was able to recognize 40% of the subjective sentences in a test set of 2197
sentences (59% are subjective) with a 90% precision. In order to compare, the HP subjective
classifier alone recognizes 33% of the subjective sentences with a 91%precision.Along this
original method, more classical data mining algorithm are used such as the nave bayes
classifier.
The general concept is to split each sentence in features -- such as presence of words, presence
of n-grams, and heuristics from other studies in the field -- and to use the statistics of the
training data set about those features to classify new sentences. Their results show that the
more features, the better. They achieved at best a 80-90%recall and precision classification for
subjective/opinions sentences and a 50% recall and precision classification for objective/facts
sentences. The sentence-level sentiment classification methods are improving, this results
from research studies in 2003 show that they were already quite efficient then and that the task
is possible.
3.3.3 Feature-based sentiment analysis

Sentiment classifications at both document and sentence (or clause) level are useful, however
they do not find what the opinion holder liked and disliked. A negative sentiment on an object
does not mean that the opinion holder dislikes everything about the object. Similarly, a
positive sentiment on an object does not mean that the opinion holder likes everything about
the object. Thus, sentiment analysis at the feature level is necessary.
Considering user reviews as our primary source of opinionated data in this model, we look at
different review formats that the system is expected to handle (most of which are commonly
used in websites and forums).
Format 1 - Pros, Cons and detailed review: The reviewer is asked to describe Pros and
Cons separately and also write a detailed review. Epinions.com uses this format.
Format 2 - Pros and Cons: The reviewer is asked to describe Pros and Cons separately.
C|net.com used to use this format.
Format 3 - free format: The reviewer can write freely, i.e., no separation of Pros and
Cons. Amazon.com uses this format.
3.4 System structure
The proposed system aims to identify semantic orientations of movie reviews. This task can
be conducted in eight main steps (Shown as Figure 1):
Step1. The first step is to parse the movie reviews by Stanford parser.
Step2. The second step is to extract adjectives and features from reviews with their
grammatical relationships, i.e. to take grammar analysis of reviews. In this step, all adjectives
and the features modified or described by them are found. This step is conducted with
grammatical knowledge.
Step3. This step is to predict the polarity of adjectives. Based on WordNet, all adjectives are
divided into five groups. The polarity knowledge library of adjectives is output in this step.
Step4. The similarity between features and topic is computed.
Step5. Based on polarity library and similarity between features and topic, the value of
original score of every noun is produced.
Step6. This step is to produce the value of weighted score.
Step7. In order to predict the semantic orientation of reviews, it must be represented with a
vector. This step is to construct the vector of review with weighted score.
Step8. The last step is to produce polarity labels for reviews. We use two methods machine
learning technology and total weighted score computing to generate polarity labels of reviews
and compare their precision with simple term counting method.
10
Chapter 4.System Analyses And Design

4.1 ACTIVITY DIAGRAM
11
Admin Login
Input Keyword
Tweets Retreval
Data Preprocessiong
Classification using
NLP
Feature Set
Extraction
Extraction of +ve and -ve

keywords
QT clustering
Results & Graphical

Respresentation
4.2 USECASE
DIAGRAM
12
Retrieve Tweets
Admin
Feature Extraction of words
QT clustering
Result Generation
4.3 SEQUENCE DIAGRAM
13
4.4 COLLABORATION DIAGRAM
14
4.5 CLASS DIAGRAM
15
Chapter 5. Proposed System

5.1 Overview
The proposed system takes care of the informal reviews. The idea of formal and informal
words has been revised little in the proposed system. The use of English language can be
grammatically incorrect with chances of spelling mistakes and wide use of shortcuts.
Technically we can phrase these by calling them formal opinions and informal opinions.
Where formal opinions refer to the use of proper English with no grammatical mistakes and
informal refers to the improper use of English with grammatical mistakes, involving use of
words which are not standard spellings e.g. 4get instead of forget. They say that the UK
English is said to be formal English and US English is considered to be informal English.
16
Most of the customers use US English, hence it is very important to concentrate on these
opinions too, or else they might be discarded as noise. Hence the system proposed deals with
this aspect of informal reviews.
The purpose of the analysis is to extract, organize, and classify the information contained in
the required documents. The required document undergoes cleaning for unwanted stop-words
and words not listed in the dictionary. It is then classified as formal or informal depending on
the use of words, which then undergoes the application of POS tags. Thus the feature and its
accompanying opinion is identified and extracted. The output of the proposed system is
feature-opinion pairs from the review documents.
The complete architecture of the proposed opinion mining system, which consists of five
different functional components
review documents crawler,
document pre-processor,
document parser,
feature and opinion learner
17
Architecture of the proposed opinion mining system
5.1.1 Review Documents Crawler and Document Pre-processor

For a target review site, the crawler retrieves review documents and stores them locally after
filtering markup language tags. The filtered review documents are divided into manageable
record-size chunks whose boundaries are decided heuristically based on the presence of
special characters. It has been found that granularity of words, word stems, and word
synonyms may cause problem while extracting real features and opinion. We have applied
rigorous reprocessing on review documents to filter out noisy reviews that are introduced
either without any purpose or to increase/decrease the popularity of the product.
18
5.1.2 Document Parser

The functionality of this module is to facilitate the linguistic and semantic analysis of text for
information component extraction. This module accepts record-size chunks generated by
document pre-processor as input to assign Parts-Of-Speech (POS) tags to each word. It also
converts each sentence into a set of dependency relations between the pair of words. For POS
analysis and dependency relation generation purpose, we have used Stanford parser1, which is
a statistical parser. As observed in, noun phrases generally correspond to product features,
adjectives refer to opinions and adverbs are generally used as modifiers to represent the
degree of expressiveness of opinions, we have applied POS-based filtering mechanism to
avoid unwanted texts from further processing.
5.1.3 Feature and Opinion Learner

This module is responsible to analyze dependency relations generated by document parser and
generate all possible information components from them. The dependency relations between a
pair of words w1 and w2 is represented as relation type (w1; w2), in which w1 is called head
or governor and w2 is called dependent or modifier. The relationship relation type between w1
and w2 can be of two types- i) direct and ii) indirect. In a direct relationship, one word
depends on the other or both of them depend on a third word directly, whereas in an indirect
relationship one word depends on the other through other words or both of them depend on a
third word indirectly. An information component is defined as a triplet < f; m; o >, where f
represents a feature generally expressed as a noun phrase, o refers to opinion which is
generally expressed as adjective, and m is an adverb that acts as a modifier to represent the
19
degree of expressiveness of the opinion. As pointed out in, opinion words and features are
generally associated with each other and consequently, there exist inherent as well as semantic
relations between them. Therefore, the feature and opinion learner module is implemented as
a rule-based system, which analyzes the dependency relations to identify information
components from review documents. For example, consider the following opinion sentences
related to Nokia N95:
(i)
The screen is very attractive and bright.
(ii)
The sound sometimes comes out very clear.
(iii)
Nokia N95 has a pretty screen.
(iv)
Yes, the push email is the \Best" in the business.
In example (i), the screen is a noun phrase which represents a feature of Nokia N95, and the
adjective word attractive can be extracted using nominal subject nsubj relation (a dependency
relationship type used by Stanford parser) as an opinion. Further, using advmod relation the
adverb very can be identified as a modifier to represent the degree of expressiveness of the
opinion word attractive. In example (ii), the noun sound is a nominal subject of the verb
comes, and the adjective word clear is adjectival complement of it. Therefore, clear can be
extracted as opinion word for the feature sound. In example (iii), the adjective pretty is parsed
as directly depending on the noun screen through amod relationship. If pretty is identified as
an opinion word, then the word screen can be extracted as a feature; likewise, if screen is
identified as a feature, the adjective word pretty can be extracted as an opinion. Similarly in
example (iv), the noun email is a nominal subject of the verb is, and the word Best is direct
object of it. Therefore, Best can be identified as opinion word for the feature word email.
Based on these and other observations, we have defined different rules to tackle different
types of sentence structures to identify information components embedded within them.
Rule-1: In a dependency relation R, if there exist relationships nn(w1;w2) and nsubj(w3;w1)

such that POS(w1) = POS(w2) = NN_, POS(w3) = JJ* and w1, w2 are not stop-words, or if
there exists a relationship nsubj(w3;w4) such that POS(w3) = JJ*, POS(w4) =NN* and w3,
w4 are not stop-words, then either (w1;w2) or w4 is considered as a feature and w3 as an
opinion.

20
such that POS(w1) = POS(w2) = NN_, POS(w3) = JJ* and w1, w2 are not stop-words, or if
there exists a relationship nsubj(w3;w4) such that POS(w3) = JJ*, POS(w4) =NN_ and w3,
w4 are not stop-words, then either (w1;w2) or w4 is considered as the feature and w3 as an
opinion. Thereafter, the relationship advmod (w3; w5) relating w3 with some adverbial word
w5 is searched. In case of presence of advmod relationship, the information component is
identified as < (w1; w2) or w4; w5; w3 > otherwise < (w1; w2) or w4; -; w3 >.

such that POS(w1) = POS(w2) = NN_, POS(w3) = V B_ and w1, w2 are not a stop-words, or
if, there exist a relationship nsubj(w3;w4) such that POS(w3) = V B*, POS(w4) = NN* and
w4 is not a stop-word, then we search for acomp(w3;w5) relation. If acomp relationship exists
such that POS (w5) = JJ_ and w5 is not a stop-word then either (w1; w2) or w4 is assumed as
the feature and w5 as an opinion. Thereafter, the modifier is searched and information
component is generated in the same way as in Rule-2.
The need to identify and interpret possible difference in the linguistic style of textssuch as
formal or informalis increasing, as more people use the Internet as their main research
resource. There are different factors that affect the style, including the words and expressions
used and syntactical features. Vocabulary choice is likely the biggest style marker. In general,
longer words and Latin origin verbs are formal, while phrasal verbs and idioms are informal
(Park, 2007). There are also many formal/informal style equivalents that can be used in
writing.
The formal style is used in most writing and business situations, and when speaking to people
with whom we do not have close relationships. Some characteristics of this style are long
words and using the passive voice. Informal style is mainly for casual conversation, like at
home between family members, and is used in writing only when there is a personal or closed
relationship, such as that of friends and family. Some characteristics of this style are word
contractions such as wont, abbreviations like phone, and short words. We discuss the
main characteristics of both styles
5.2 Characteristics of Informal Style Text

The informal style has the following characteristics:
1. It uses a personal style: the rst and second person (I and you) and the active
voice (e.g., I have noticed that...).
2. It uses short simple words and sentences (e.g., latest).
21
3.
4.
5.
6.
It uses contractions (e.g., wont).

It uses many abbreviations (e.g., TV).
It uses many phrasal verbs in the text (e.g., nd out).
Words that express rapport and familiarity are often used in speech, such as brother,
buddy and man.
7. It uses a subjective style, expressing opinions and feelings (e.g.pretty, I feel).
8. It uses vague expressions, personal vocabulary and colloquialisms (slang words are
accepted in spoken text, but not in written text (e.g., wanna = want to))
5.3 Characteristics of Formal Style Text

The formal style has the following characteristics:
1. It uses an impersonal style: the third person (it, he and she) and often the passive
voice (e.g., It has been noticed that...).
2. It uses complex words and sentences to express complex points (e.g., state-of-the-art).
3. It does not use contractions.
4. It does not use many abbreviations, though there are some abbreviations used in formal
texts, such as titles with proper names (e.g., Mr.) or short names of methods in scientific
papers (e.g., SVM).
5. It uses appropriate and clear expressions, precise education, and business and technical
vocabularies (Latin origin).
6. It uses polite words and formulae, such as Please, Thank you, Madam and Sir.
7. It uses an objective style, citing facts and references to support an argument.
8. It does not use vague expressions and slang words.
Feature
Informal Style
Formal Style
Contractions
Use Contractions:
Avoid Contractions:
e.g., Many patients dont listen to

their doctors.
e.g., Many patients do not

listen to their doctors.
22
Phrasal
Use phrasal verbs:
Avoid phrasal verbs:
Verbs
e.g.; I looked up information about

nursing positions.
e.g., I researched
information about nursing
positions.
Personal/
Use personal pronouns:
Use impersonal pronouns:
Impersonal
e.g., I think this is an effective plan.
e.g., This could be an

effective plan.
Pronouns
List of Informal and Formal Words
Informal
Formal
about
approximately
and
in addition
anybody
anyone
ask for
request
boss
employer
but
however
buy
purchase
end
finish
enough
sufficient
get
obtain
go up
increase
have to
must
23
Contractions List
Informal
Formal
arent
are not
cant
cannot
didnt
did not
hadnt
had not
hasnt
has not
Im
I am
Abbreviations List
Informal
Formal
asap
as soon as possible
grad.
graduate
HR
Human Resources
Feb.
February
Lab
laboratory
temp
Temperature
24
Short Code list

Msg
Luv
Wlcm
Plzz
Whn
Wld
Btw
Bcoz
Message
Love
Welcome
Please
When
Would
By the way
Because
Chapter 6. Implementation
6.1 Hardware Resources Required:
Hardware Requirements
1
RAM
1GB or more
Processor
Intel dual core or later versions
OS
Microsoft Windows XP or Microsoft Windows 7
6.2 Software Resources Required:

Software Requirements
Software for
1
development
2
Platform
Netbeans, NLP, JDK,

Windows
25
Dataset
Review Website
6.3 Selection of a platform

A computing platform includes a hardware architecture and a software framework (including
application frameworks), where the combination allows software to run. Typical platforms
include a computer architecture, operating system and runtime libraries. A platform is a
crucial element in software development. A platform might be simply defined as a place to
launch software. The platform provider offers the software developer an undertaking that
logic code will run consistently as long as the platform is in place.
The platform for this project used is Windows. The reason for selecting Windows as
the operating platform is because of its user friendly characteristic and its other versions.
Windows is popularly known to all and widely used all over the world. Microsoft has made
several advancements and changes that have made it a much easier to use operating system,
and although arguably it may not be the easiest operating system, it is still Easier than Linux.
Windows comes with built-in Internet Connection Firewall software that provides you with a
resilient defence to security threats when you're connected to the Internetparticularly if you
use always-on connections such as cable modems and DSL. Windows offers thousands of
security-related settings that can be implemented individually.
All Windows versions have been based on a file system permission system referred to
as AGLP (Accounts, Global, Local, Permissions). So creating files and their dynamic use can
become simpler in case of Windows Platform as compared to others.
The other main reason for selecting Windows as the computing platform is the availability of
open-source free licensed softwares that gives the developer good opportunity to explore and
play with many things. Because of the large amount of Microsoft Windows users, there is a
much larger selection of available software programs, utilities for Windows.
6.4 Software used

6.4.1 Netbeans
NetBeans is an integrated development environment (IDE) for developing primarily
with Java, but also with other languages, in particular PHP, C/C++, and HTML5. It is also an
application framework for Java desktop applications and others.
The NetBeans IDE is written in Java and can run on Windows, OS X, Linux, Solaris and other
platforms supporting a compatible JVM.
The NetBeans Platform allows applications to be developed from a set of modular software
components called modules. Applications based on the NetBeans Platform (including the
NetBeans IDE itself) can be extended by third party developers
NetBeans IDE provides first-class comprehensive support for the newest Java technologies
and latest Java specification enhancements before other IDEs. It is the first free IDE providing
support for JDK 8 previews, JDK 7, Java EE 7 including its related HTML5 enhancements,
26
and JavaFX
With its constantly improving Java Editor, many rich features and an extensive range of tools,
templates and samples, NetBeans IDE sets the standard for developing with cutting edge
technologies out of the box.
Netbeans Code Editor
6.4.2 NLP Stanford Parser

A natural language parser is a program that works out the grammatical structure of
sentences, for instance, which groups of words go together (as "phrases") and which words
are the subject or object of a verb. Probabilistic parsers use knowledge of language gained
from hand-parsed sentences to try to produce the most likely analysis of new sentences. These
statistical parsers still make some mistakes, but commonly work rather well. Their
development was one of the biggest breakthroughs in natural language processing.
The parser runs under Windows and Unix/Linux/MacOSX and requires a Java Runtime
Environment (JRE) (Java 1.5 or higher).
Parsers for different languages such as Chinese, Arabic, English and German are provided. In
most cases, the probabilistic context-free grammar (PCFG) parser will be sufficient, since it
processes fast, shows good accuracy values and moderate memory usage. A PCDG model is
not provided for parsing German texts. The parser model called FACTORED is more complex
and requires more memory because it contains two grammars and leads the system to run
27
three parsers.
POS Tag
CC
CD
DT
EX
FW
IN
JJ
JJR
JJS
LS
MD
NN
NNS
NNP
NNPS
PDT
POS
PRP
PRP$
RB
RBR
RBS
RP
TO
UH
VB
VBD
VBG
VBN
VBP
VBZ
Description
coordinating conjunction
cardinal number
determiner
existential there
foreign word
preposition/subordinating
conjunction
adjective
adjective, comparative
adjective, superlative
list marker
modal
noun, singular or mass
noun plural
proper noun, singular
proper noun, plural
predeterminer
possessive ending
personal pronoun
possessive pronoun
Example
and
1, third
the
there is
dhoevre
in, of, like
big
bigger
biggest
1)
could, will
door
doors
John
Vikings
both the boys
friends
I, he, it
my, his
however, usually, naturally,
adverb
here, good
adverb, comparative
better
adverb, superlative
best
particle
give up
to
to go, to him
interjection
uhhuhhuhh
verb, base form
take
verb, past tense
took
verb, gerund/present participle Taking
verb, past participle
taken
verb, sing. present, non-3d
take
verb, 3rd person sing. present takes
28
WDT
WP
WP$
WRB
wh-determiner
wh-pronoun
possessive wh-pronoun
wh-abverb
which
who, what
whose
where, when
Capter 6. CONCLUSION
Emotion and mood often affect user's behavior and his/her interaction with other users.
Positivity and negativity are two important attributes of emotion and mood. Based on our
analysis, we explored several key differences across positive and negative users of online
social networks. We introduced a methodology for tweet partitioning and user partitioning,
which led to seven groups of users from highly negative to highly posItIve. Our findings show
that negative users are not interested in sharing their negativity in social media when
compared neutral and positive users. In addition, positive users are more likely to make
friendships with negative users based on the number of followers and followees. An
interesting result is that negative users retweet much more than positive users. They use
Twitter as a tool for social awareness and they need to gain emotional support. Retweeting
positive tweets makes them more positive. Also, negative users try to avoid any interaction by
replying to tweets, since they believe it can make them more negative. In addition, positive
users also don't interact much with negative users, which means everyone triesto avoid
29
interacting with negative users, perhaps because such interactions make them more negative.
Our findings can beused in developing tools for automatically identifying positive and
negative users based on user activities in Twitter. The key distinguishing criteria are how
many followers and followees a user has, the volume and type (positive or negative) of tweets
a user send out, the volume and type of tweets a user retweets as well as replies to. Our future
work involves building a classifier based on the findings of this paper.
REFERENCES
[1] A. Gluhakl, M. Presser, L. Zhu, S. Esfandiyari, S. Kupschick, "TowardsMood Based
Mobile Services and Applications," EuroSSC. Kendal, vol.4793, pp. 159-174, October 2007.
[2] C.M. Lee, and S.S. Narayanan, "Toward detecting emotions in spokendialogs," IEEE
Trans. Speech and Audio Processing, vol. 13, no. 2, pp.293-303, March 2005.
[3] S.P. Dibris, A. Stagliano, A. Camurri, A. F.O. Dibris, "A set offull-bodymovement features
for emotion recognition to help children affected byautism spectrum condition," IDGEI
International Workshop.Chania, May2013, in press.
[4] L. Rui, S. Wang, H. Deng, R. Wang, K. C. Chang, "Towards social userprofiling: unified
and discriminative influence model for inferring homelocations,". In LinkKDD. Beijing, pp.
1023-1031, August 2012.
[5] R. Picard, Affective Computing, M.l.T Media Laboratory PerceptualComputing Section
Technical Report, vol. 321,1995, pp. 1-26.
30

Analysis & Review of Product Using Twitter Tweets As A Dataset

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Analysis & Review of Product Using Twitter Tweets As A Dataset

Uploaded by

Copyright:

Available Formats

A

Analysis & Review of Product using Twitter

Prof. Chandrashekhar Badgujar

Department of Computer Engineering

Mining is a bonafide work of

3. Ms. Rao Vidya

Submitted to the University of Mumbai in partial fulfillment of

(Prof. Chandrasekhar Badgujar)

(Prof. Jagrati Shekhawat)

(Dr. Anitha Patil)

(Dr. Santosh K. Narayankhedkar)

Project Report Approval for B. E.

BE Students is approved for the degree of Computer Engineering

Gadade Dnyaneshwar (111CP1193)

(Khavase Namrata (111CP1727)

Rao Vidya (111CP1726)

How does Opinion Mining System Works?

1.1 Scope of the Project

Chapter 2. MOTIVATION & APPROACH

Chapter 3. Related Work

3.2 Existing System

3.3 Level of Opinion Mining

The document level,

The sentence level.

The feature level.

level, identification of subjective or opinionated sentences amongst the corpus is done by

Task 3: Grouping feature synonyms and producing a feature-based opinion summary

3.3.1 Document Level Sentiment Analysis

Example (positive Document)

1) Classify documents (e.g., reviews) based on the overall sentiments expressed by

Positive, negative, and (possibly) neutral

2) Similar but different from topic-based text classification.

In topic-based text classification, topic words are important.

3) Mainly at the document-level, but also extendable to the sentence-level.

It does not give details on what people liked or disliked

It is not easily applicable to non-reviews, e.g. forum and blog postings

3.3.2 Sentence-level sentiment Analysis

3.3.3 Feature-based sentiment analysis

3.4 System structure

Chapter 4.System Analyses And Design

Extraction of +ve and -ve

Results & Graphical

4.3 SEQUENCE DIAGRAM

4.4 COLLABORATION DIAGRAM

4.5 CLASS DIAGRAM

Chapter 5. Proposed System

review documents crawler,

feature and opinion learner

Architecture of the proposed opinion mining system

5.1.1 Review Documents Crawler and Document Pre-processor

5.1.2 Document Parser

5.1.3 Feature and Opinion Learner

The screen is very attractive and bright.

The sound sometimes comes out very clear.

Nokia N95 has a pretty screen.

Yes, the push email is the \Best" in the business.

Rule-1: In a dependency relation R, if there exist relationships nn(w1;w2) and nsubj(w3;w1)

Rule-2: In a dependency relation R, if there exist relationships nn(w1;w2) and nsubj(w3;w1)

Rule-3: In a dependency relation R, if there exist relationships nn(w1;w2) and nsubj(w3;w1)

5.2 Characteristics of Informal Style Text

It uses contractions (e.g., wont).

5.3 Characteristics of Formal Style Text

e.g., Many patients dont listen to