You are on page 1of 5

Literature Survey on SemEval-2015 systems for

Twitter Polarity Classification

Mateusz Garbacz

Delft University of Technology


m.garbacz@student.tudelft.nl

Abstract. The paper reviews the most successful systems of SemEval-


20151 , Task 10, Subtasks A and B. These systems classify the polarity
of tweets on the phrase level and the message level. Moreover, the focus
is put on the methodology applied with relation to the results obtained.
It is shown that non-linear classifiers achieve superior results for the
problem and Lexicon-based features significantly increase capabilities of
the systems. Finally, the paper builds a knowledge basis for participation
in next editions of SemEval workshops.

Keywords: Twitter, Polarity Classification, SemEval-2015

1 Introduction
Online social networking platforms like Twitter or Facebook, where users interact
with each other by means of e.g. messages expressing opinions on a given topic,
have significantly grown over years. Moreover, the amount of data generated
by these platforms increases every year. Therefore, the need for accurate data
analysis techniques is currently more significant than ever before. [8]
To address this need Semantic Evaluation 2015 Workshop (SemEval-2015)
provides a platform promoting research in multiple fields related to Information
Retrieval. [4] One of them, namely Polarity Classification on Twitter data is
covered in the Task 10, Subtasks A at the phrase level and B at the message
level. In this task a system needs to classify given message or its part as positive,
negative or neutral. Thus, the researchers are given annotated datasets, based on
which they have to build systems to solve the Subtasks and compete to achieve
the highest results on the official test set. [4]
The goal of this paper is to review the most successful works in SemEval 2015
Task 10, that participated in both mentioned Subtasks. The main motivation
for that is studying state-of-the-art methodology for approaching the problem
of Polarity Classification in tweets and their parts. Therefore, aspects like pre-
processing, feature extraction, classification model training, and evaluation of
the entire systems are analyzed. Finally, the review forms a knowledge basis for
taking part in the future editions of the workshops as well as simply developing
an effective system for Polarity Classification on the social media data.

2 Papers selection
This section presents an overview of the literature chosen for the review. There
are three criteria for the paper selection:
1. the system has to be applied in both Subtasks A and B to enable fair analysis
of all the selected papers
1
Semantic Evaluation 2015 Workshop
II

2. the system needs to have a high ranking [6] in both Subtasks A and B to
evaluate only the most effective solutions
3. the system needs to be well described in terms of implementation to enable
successful analysis
Due to these constraints several papers have been excluded, for instance: N.
Plotnikova et al. [5] because of a lacking description of the system in terms of
the features extracted and system developed by M. Hagen et al. [2], which is not
applied in Subtask A.
Thus, the selected systems that match the criteria are listed below:

– UNITN- A. Severin et al. [7] present their deep learning system trained on
word embeddings, which achieved the 1st place and 2nd in Subtasks A and
B
– IOA- P. Li et al. [4] use a Support Vector Machine (SVM) classifier with
RBF kernel on a large variety of features, which achieved 3rd and 7th place
in Subtasks A and B respectively
– TwitterHawk- W. Boag et al. [1] focus on extensive pre-processing and then
applied linear classifiers on the extracted features. As a result, the system
reaches 5th and 10th position in Subtasks A and B
– IITPSemEval- A. Kumar et al. [3] apply linear SVM models on various types
of features, which is ranked 6th and 22th the Subtasks

3 Pre-processing and Feature Extraction


This section describes pre-processing and feature extraction techniques applied
in each paper, in order summarize the proven methodology in the field.
The pre-processing methods implemented in each system are depicted in
Table 1. Clearly, tokenization and normalization of the author and URLs are
standard methods to apply for this kind of problems. Nevertheless, TwitterHawk
system implements much more extensive pre-processing. It includes hashtag seg-
mentation for instance “#WeLoveBowling” into “We Love Bowling” and spelling
correction, for example “hahahaha” is converted into “haha” and “heyyyy” into
“hey”, and therefore, enables extracting high quality features. [1]

Pre-processing Technique UNITN IOA TwitterHawk IITPSemEval


Tokenization X X X X
Normalization of URL and author X X X X
Lowercasing X X
Spelling Correction X
Hashtag Segmentation X
Table 1. Pre-processing techniques applied by reviewed systems

Furthermore, the Table 2 depicts the Feature Extraction techniques applied


in the reviewed systems. IOA, TwitterHawk and IITPSemEval systems extract
similar types of features. The researchers exploit basic text features e.g. number
of commas in text, word and character ngrams as well as inverted sentiment,
which for every negation inverts sentiment of a few following tokens. Moreover,
they make use various tools like CMU Tweet NLP2 for Part-Of-Speech tagging
and Word Clustering or up to seven different sentiment lexicons [4] to assign
the sentiment score to the tokens. Interestingly, the systems that extract the
III
Feature Extraction Technique UNITN IOA TwitterHawk IITPSemEval
Basic Text Features X X X
Word or character ngrams X X X
POS tagging X X X
Word clustering X X X
Inverted Sentiment(negation) X X X
Lexicon-based features X X X
Word embeddings X X
Table 2. Feature Extraction techniques applied in reviewed papers

word-embeddings using e.g. word2vec3 achieve the highest ranks of the reviewed
papers in both Subtasks A and B.
Surprisingly, the UNITN system, which performs a minimal pre-processing
and retrieves only a single type of features, achieves the 1st and 2nd rank in
mentioned Subtasks. This might be a result of applying a sophisticated classifier
and training strategy described further in the paper (Section 4). Moreover, it
is not specified how many features of each type the systems extract, which is
an important factor to take into consideration during the analysis. Nonetheless,
the IOA system makes use of all types of features described in the paper and
gets very high scores as well. Although, extensive pre-processing and feature
extraction do not guarantee supreme results, they allow to create a high-quality
solution.

4 Classifier Training
This section focuses on the training process of the classifiers. In contrast to
the previous section, the classifier training methodology for each paper differs
significantly, and therefore, they are reviewed separately.
UNITN system is based on a Convolutional Neural Network. Since the amount
of training data for the deep learning process is very low, the researchers propose
a 3-step initialization process. Firstly, they collect 50M tweets and use them to
learn the word embeddings using unsupervised word2vec model. This step facili-
tate convergence of the network in the further steps. Secondly, they use a distant
supervision approach, in which they collect another 10M tweets and assign them
to positive, neutral or negative class according to present emoticons. Then, they
feed the network with this data to refine the embeddings and initialize the sen-
timent capturing. Thirdly, the researchers use the parameters of the network as
a starting point for supervised training on the SemEval dataset. [7]
The 3-phase approach combines unsupervised, semi-supervised and super-
vised learning. Therefore, they show how to incorporate unannotated data into
the problem, which may be improved by streaming more tweets or selecting
more sophisticated rule in the second step. Consequently, the researchers train
an extremely complex model using significant amount of streamed and provided
data, which wouldn’t be that effective using a supervised approach only.
IOA system implements two different approaches using SVM classifier with
RBF kernel. The approach for Subtask A is a semi-supervised iterative algorithm,
which trains the classifier on the training set and calculates posterior probabili-
ties for the non-annotated samples. The instances with high enough probability
of belonging to any of the classes are assigned to these classes and added to
the training set for the next iteration. The approach for B simply weights the
2
http://www.cs.cmu.edu/ ark/TweetNLP/
3
https://deeplearning4j.org/word2vec
IV

posterior probabilities given by trained supervised classifier and the class with
maximum posterior weighted probability is selected for a given sample. [4]
The system uses a complex, non-linear classifier on the data containing many
features. Consequently, it catches the hidden patterns well, however, using such
model increases chance of overfitting, which cannot be indicated based on the
provided information. Moreover, the performance is further improved by iterative
algorithm and posterior probability weighting.
In contrast, Twitter Hawk trains a linear SVM and a SGDClassifier for Sub-
task A and B respectively. Further, the researchers perform the ablation studies,
in which they simply train the classifier using different sets of features and test
which model works best. [1]
Similarly, IITPSemEval uses a linear SVM for training in both Subtasks.
However, before training the researchers perform oversampling to balance the
class distribution and then, after the training phase they perform the ablation
studies to pick the best set of features. [3]
The last two approaches use much simpler models, linear classifiers, which are
not able to handle the hidden patterns in the data as well as the non-linear ones.
Clearly, the researchers focused much more on extracting informative features
from the data than applying different classification variants.
To conclude, the most successful approaches utilize complex classifiers and
use additional non-annotated data for better training. Nevertheless, systems,
which apply linear classifiers perform satisfactory, although, they are ranked
lower overall.
5 Evaluation
This section sums up the evaluation and conclusions made by the researchers for
each paper, as well as critically assesses the reviewed literature. The standard
evaluation metric for the workshops is an average F1 score of positive and neg-
ative classes. Even though, the measure disregards neutral F1 , score it gives an
indication of how well the systems classify positive and negative classes.
UNITN system scored F1 of 84.79% and 64.59% on the main test set for
Subtasks A and B respectively. Moreover, the researchers show that it is highly
robust by depicting its excellent testing results on other test sets as well. [7]
IOA system achieved F1 score of 82.76% and 62.63% in the Subtasks. Ex-
perimenting with different sets of features and method parameters introduced
improvements to results on most of the test sets, but degraded others. [4]
TwitterHawk scored 82.32% and 61.99% in the Subtasks. The researchers
highlight the most influential features, for instance Lexicon-based features and
which methods require improvements e.g. hashtag segmentation. [1]
Finally, the IITPSemEval scored 81.31% and 58.8% F1 scores. It is shown
that, Lexicon-based features are the most influential, while inverted sentiment
did not lead to the expected improvements. [3]
To sum up, all systems get superior results on the phrase level task com-
pared to the message level task. This is mainly caused by messages containing
words of different sentiment and being dependent on various aspects, for in-
stance sarcasm. Moreover, using Lexicon-based features is beneficial for the final
model and utilizing word embeddings is sufficient to train a top ranked system.
However, it is not mentioned how many features each system exploits of each
type, which should be considered during the analysis. Although, UNITN system
achieves the highest F1 scores, the other reviewed works obtain approximate re-
sults, differing by at most 6% in the Subtask B. This is most probably caused by
V

a limited amount of annotated training data that can be used for training. Nev-
ertheless, the UNITN system exploited additional 60M tweets during training,
which is a major factor of achieving the top scores. Therefore, it is question-
able, whether the other systems would get higher results if they utilized high
amounts of streamed unannotated data. Finally, the papers focus mainly on the
implementation instead of system evaluation and justification of the results. As
a result, none of them takes overfitting into account, which is a major issue of
many systems in Machine Learning domain. Thus, it may appear while training
classifiers on a large number of features and low number of samples.
6 Conclusions
The paper reviews the most successful works taking part in the SemEval-2015
Task 10, Subtasks A and B. It analyzes the pre-processing applied, features ex-
tracted, classifier training methodology, and conclusions made by the researchers.
It is shown that, careful feature engineering does not guarantee top results. More-
over, complex, non-linear classification models achieve superior results to linear
ones. Finally, streamed, unannotated data may be used to improve the results of
the complex classifiers, which typically require large amounts of data for training.
7 Future Work
The paper builds a knowledge basis for building an effective system for Twitter
Polarity Classification. Therefore, the next steps might be extending the knowl-
edge by analyzing results of SemEval-2016 using the same approach and taking
part in the further edition of Semantic Evaluation workshops.

References
1. Boag, W., Potash, P., Rumshisky, A.: Twitterhawk: A feature bucket based approach
to sentiment analysis. In: SemEval@NAACL-HLT (2015)
2. Hagen, M., Potthast, M., Büchner and Benno Stein, M.: Webis: An ensemble for
twitter sentiment detection. In: SemEval@NAACL-HLT (2015)
3. Kumar, A., Krishna, V., Ekbal, A.: Iitpsemeval: Sentiment discovery from 140 char-
acters. In: SemEval@NAACL-HLT (2015)
4. Li, P., Xu, W., Ma, C., Sun, J., Yan, Y.: Ioa: Improving svm based sentiment
classification through post processing. In: SemEval@NAACL-HLT (2015)
5. Plotnikova, N., Kohl, M., Volkert, K., Evert, S., Lerner, A., Dykes, N., Ermer, H.:
Klueless: Polarity classification and association. In: SemEval@NAACL-HLT (2015)
6. Rosenthal, S., Nakov, P., Kiritchenko, S., Mohammad, S., Ritter, A., Stoyanov, V.:
Semeval-2015 task 10: Sentiment analysis in twitter. In: SemEval@NAACL-HLT
(2015)
7. Severyn, A., Moschitti, A.: Unitn: Training deep convolutional neural network for
twitter sentiment classification. In: SemEval@NAACL-HLT (2015)
8. Skuza, M., Romanowski, A.: Sentiment analysis of twitter data within big data dis-
tributed environment for stock prediction. In: 2015 Federated Conference on Com-
puter Science and Information Systems (FedCSIS) (2015)

You might also like