You are on page 1of 5

2016 8th International Conference on Information Technology and Electrical Engineering (ICITEE), Yogyakarta, Indonesia

Real-time Traffic Classification with Twitter Data


Mining

Dwi Aji Kurniawan1, Sunu Wibirama2, Noor Akhmad Setiawan3


Department of Electrical Engineering and Information Technology
Universitas Gadjah Mada
Indonesia
1
dwi.aji.k@mail.ugm.ac.id, 2sunu@ugm.ac.id, 3noorwewe@ugm.ac.id

Abstract The growth of vehicles in Yogyakarta Province, Supervised Latent Dirichlet Allocation (sLDA). Gutirrez et al.
Indonesia is not proportional to the growth of roads. This problem [5] classify English tweets using Support Vector Machine
causes severe traffic jam in many main roads. Common traffic (SVM) classifier in RapidMiner software. These studies used
anomalies detection using surveillance camera requires manpower local language as their source of information. Moreover, these
and costly, while traffic anomalies detection with crowdsourcing studies focused more on traffic events such as accidents, road
mobile applications are mostly owned by private. This research work, snow, road closures, but did not focus on the state of
aims to develop a real-time traffic classification by harnessing the traffic flow, such as traffic jams, crowded, crowded, crowded
power of social network data, Twitter. In this study, Twitter data smooth, and smooth.
are processed to the stages of preprocessing, feature extraction,
and tweet classification. This study compares classification There were also some studies in Indonesia related to the use
performance of three machine learning algorithms, namely Naive of social networking as traffic conditions monitoring. Research
Bayes (NB), Support Vector Machine (SVM), and Decision Tree by Wibisono et al. [6] in Jakarta used the concept of Learning
(DT). Experimental results show that SVM algorithm produced Vector Quantization (LVQ) neural network to classify tweets
the best performance among the other algorithms with 99.77% into three classes: low traffic flow, medium traffic flow, and
and 99.87% of classification accuracy in balanced and imbalanced high traffic flow. The system developed by Wibisono et al. [6]
data, respectively. This research implies that social network used tweet from the official account of traffic officers as a data
service may be used as an alternative source for traffic anomalies source. Another study in Bandung by Rodiyansyah and Winarko
detection by providing information of traffic flow condition in
[7] classified four classes (Loss, Current, Unknown, and Model)
real-time.
on traffic tweets using Naive Bayes and Support Vector
Keywords traffic, data mining in Twitter, social network, tweet Machine (SVM) algorithm using RapidMiner software.
classification, machine learning. However, those previous work [6, 7] did not classify Twitter
data from all user (regular and official) in real-time.
I. INTRODUCTION In this research, we propose a novel real-time traffic
The growth of vehicles in big cities is not proportional to the classification by classifying Twitter data into traffic or
growth of roads. Sooner or later, roads in big cities will be non_traffic category. Classification was validated using ten
increasingly jammed. Installation of surveillance cameras in folds cross validation to measure accuracy, precision, recall, and
some streets and intersections has been common approach of F-score of the classifiers and dataset.
real-time traffic anomalies detection. Nevertheless, this
approach requires manpower to observe the cameras and to II. DATA ACQUISITION
locate spatial position of the traffic information. On the other Tweets dataset about Yogyakarta Province, Indonesia were
side, location-based crowdsourcing technology such as Waze used in this research to build classification model. We
(https://www.waze.com) is currently used as drivers companion categorized tweets into traffic and non_traffic. The data were
for route finding. However, Waze is a proprietary service, thus collected consecutively in seven days.
the authorities may find it difficult to get access to the data.
In the first stage, traffic tweets were collected from seven
Social network service has been used to detect traffic official traffic monitoring Twitter accounts, namely
anomalies and events. An approach developed by Sakaki et al. @ATCS_DIY, @atcs_kotasmrg, @atcs_kotatgr,
[1] shows that Twitter detect an event faster than traditional @atcs_pekalongan, @ntmclantaspolri, @tmcpoldametro, and
media. D'Andrea et al. [2] compared the performance of seven @tmcpolressemara. The data were collected using Twitter
classification algorithms to classify Italian tweets. Sakaki et al. REST API. Tweets from those accounts were cleaned from
[3] proposed four stages to detect locations in Japanese tweets. non_traffic tweets, thus only tweets related with traffic condition
Gu et al. [4] classify tweets about traffic in the city of Pittsburgh were considered. In the second stage, tweets were collected
and Philadelphia (USA) using the Semi Naive Bayes (SNB) and using Twitter Streaming API from selected user in Table I and

978-1-5090-4139-8/16/$31.00 2016 IEEE


2016 8th International Conference on Information Technology and Electrical Engineering (ICITEE), Yogyakarta, Indonesia

Table II. We label the tweets with traffic and non_traffic


manually.
TABLE II
KEYWORDS USED IN TRACK PARAMETER
III. PROPOSED METHOD
The flowchart of the proposed method is shown in Fig. 1. Yogyakarta Jogjakarta Jogja
Yogya Adisutjipto Adi Sutjipto
lalinjogja RTMC_Jogja ATCS_DIY
jogjaupdate jogja24jam infojogja
yogyakartacity jogjamedia tribunjogja
unisifmyk UGM UII
UNY UMY lalinyk
B. Preprocessing
Preprocessing stage was applied to tweets to clean some
parts of tweets that were not needed in the next stages [8]. The
preprocessing steps in this study were as follows:
a) Removing the "RT". At this step, we used regular
expression "RT \ s" to find the appearances of "RT".
b) Converting all letters in a tweet to lowercase.
c) Removing website address in the tweet. At this step, we
use regular expression "\shttp.+\s".
d) Removing Twitter username. At this step, we used
regular expression "@[a-zA-Z0-9_]+".
e) Removing characters non-alphanumeric (alphabets and
numbers) characters. At this stage, we used the regular
expression "[^a-zA-Z0-9]".
f) Changing abbreviations to their actual phrases. We
changed abbreviations that frequently appeared in tweets.
C. Feature Extraction
This research used two types of feature extraction. The first
one was by using all words in the dataset as features. The second
one was by using only few selected words as features. Features
Fig. 1. Flowchart of the proposed method. were selected by their appearance in the dataset. We selected
words that appear most frequently in the traffic tweets dataset
A. Tweet Collection from Twitter Streaming API [9]. The steps were explained as follows:
New tweets were collected from Twitter Streaming API in a) Processing traffic tweets dataset with preprocessing
real time. There were some parameters that we used in Twitter steps.
Streaming API, such as follow and track parameters. Follow
b) Analyzing words appearance from the dataset.
parameter was used to get new tweets in real time from several
accounts, as shown in Table I. Track parameter was used to get c) Sorting by words that appeared most frequently.
new tweets in real time based on keywords defined in Table II. d) Taking 50 words that appeared most frequently.
TABLE I
e) Removing unneeded words, such as person name, place
TWITTER USERNAMES AND IDS USED IN FOLLOW PARAMETER name, etc.
Twitter Username Twitter User ID f) Removing words that had two letters or less.
@lalinjogja 250022672 A dictionary contained the words and their appearance count
@RTMC_Jogja 187397386 in a tweet was used as classifier.
@ATCS_DIY 1118238337
@twit_macet 4675666764 D. Classifier Model Building with Machine Learning
@JogjaUpdate 128175561 Algorithms
@Jogja24Jam 537556372 Tweets were classified into two categories, namely tweets
@infojogja 106780531 that were related to traffic (traffic) and tweets that were not
@YogyakartaCity 62327666 related to traffic (non_traffic). This classification was intended
@JogjaMedia 454564576 to separated tweets about traffic from another tweets. Three
@tribunjogja 223476605 machine learning algorithms, there are Nave Bayes (NB),
@unisifmyk 201720189
2016 8th International Conference on Information Technology and Electrical Engineering (ICITEE), Yogyakarta, Indonesia

Support Vector Machine (SVM), and Decision Tree (DT) were TABLE IV
used in this research. PREPROCESSING RESULT

Four parameters were used to evaluate performance [9] of Original Tweet After Preprocessing
each machine learning algorithm: UN 2016 : Tryout di SMA un 2016 tryout di sma
Muhammadiyah 3 Jogja Diikuti Ribuan muhammadiyah 3
1) Accuracy: Accuracy is the fraction of the classifications Peserta https://t.co/4bMhW4xoow jogja diikuti ribuan
result that are correct. The formula is https://t.co/zwpfMm8A57 peserta
09.55 wib lalin seputaran sp 09 55 wib lalu lintas
condongcatur ramai lancar seputaran simpang
2) Precision: Precision is the fraction of the predicted https://t.co/HRwTeIzlyt condongcatur ramai
documents in a class that are correct. The formula is lancar
D. Feature Extraction
3) Recall: Recall is the fraction of documents in a class that Feature extraction process counted the occurrences all the
correctly predicted by the system. The formula is words in a tweet as features. The dictionary that contained the
words and their occurrences was used to train classifier. A
dictionary was a set of data in key-value form. The key and the
4) F-score: F-score is a weighted harmonic mean of value were the words and their occurrences, respectively.
precision and recall. We use balanced F-score with formula Feature extraction by using only few selected words was then
 preceded with feature selection process. In feature selection
process, we got 40 words as features as shown in Table V.
We only measured precision, recall, and F-score in traffic
class. We calculated all parameters of evaluation through ten TABLE V
LIST OF FEATURES
folds cross validation technique.
Bahasa Bahasa
IV. EXPERIMENTAL RESULTS English English
Indonesia Indonesia
antrian queue maupun although
A. Tweets Data Acquisition arah direction mengarah directing
In tweet data acquisition stage, we collected 110,449 tweets arus flow menuju heading
data in total. This data were used in building classification atau or pada on
model. The 110,449 tweets data consisted of 17,592 tweets in barat west padat congested
traffic class and 92,857 tweets in non_traffic class. cerah sunny patuhi obey
cuaca weather pukul o'clock
B. Tweet Collection from Twitter Streaming API dalam in ramai crowded
dan and rambu sign
TABLE III
TWEETS FROM TWITTER STREAMING API dari from sebaliknya opposite
informasikan inform selatan south
Date and Time Tweet Text jalan road/street semua all
2016-03-15 UN 2016 : Tryout di SMA Muhammadiyah 3 kaki foot seputaran around
11:31:36 Jogja Diikuti Ribuan Peserta kami we simpang intersection
https://t.co/4bMhW4xoow kendaraan vehicle situasi situation
https://t.co/zwpfMm8A57 kondisi condition terpantau observed
2016-04-30 09.55 wib lalin seputaran sp condongcatur kota city tetap still
09:55:44 ramai lancar https://t.co/HRwTeIzlyt lalu part of phrase timur east
traffic in
Bahasa
Twitter Streaming API was a real time data source for our lancar smooth utara north
system. With Twitter Streaming API, Twitter sent tweet objects lintas part of phrase wib Western
in form of JavaScript Object Notation (JSON) once there was a traffic in Indonesian
tweet match with our follow and track parameters. There were Bahasa Time
many variables in a tweet JSON object, but we only used
created_at and text variable. Table III shows example of tweets E. Development of Classifier Model using Machine Learning
received by our system. Algorithms
The amount of data used in this research were 110,449
C. Preprocessing tweets consisted of 17,592 traffic tweets and 92,857 non_traffic
Preprocessing stage was used to prepare tweet text before tweets. The dataset was imbalanced between classes. Thus the
processed in the next stages. There were some preprocessing evaluation measurement of classification model was conducted
steps as explained in the previous section. The example of for both imbalanced dataset and balanced dataset. Evaluation
preprocessing result is shown in Table IV. measurement for imbalanced dataset used all 110,449 tweets.
Evaluation measurement for balanced dataset used 35,184
tweets consisted of 17,592 traffic tweets and 17,592 non_traffic
2016 8th International Conference on Information Technology and Electrical Engineering (ICITEE), Yogyakarta, Indonesia

tweets. The 17,592 non_traffic tweets were selected randomly VIII, NB has the fastest training time because its simple model
from 92,857 non_traffic tweets. building. Moreover, DT training time was greatly influenced by
TABLE VI
the number of data and the number of features.
EVALUATION MEASUREMENT OF BALANCED DATASET (35,184 TWEETS) After evaluating three machine learning algorithms,
imbalanced dataset with 110,449 tweets and feature extraction
Feature Model Accuracy Precision Recall F-score using all words as features was used in the application. This
NB 99.37% 99.10% 99.62% 99.36% dataset was selected because it produced accuracy than balanced
All
SVM 99.77% 99.65% 99.89% 99.77% dataset with 35,184 tweets. Moreover, the real data fetched from
words
DT 99.48% 99.44% 99.52% 99.48%
Twitter Streaming API is highly imbalanced between traffic and
NB 98.02% 96.32% 99.71% 97.99%
Selected non_traffic category. Extraction of all words as features was
SVM 98.31% 97.23% 99.37% 98.29%
words used because it produced better accuracy than using only
DT 98.41% 97.52% 99.28% 98.39%
selected words as features.
Table VI shows that for balanced dataset with 35,184 tweets
and all words as features, SVM yielded the best performance in
all measurements (99.77%) as shown with yellow color. V. CONCLUSION
However, by using only selected words as features, DT yielded This research aims to develop a traffic tweet classification of
the best accuracy, precision, and F-score as shown with yellow Yogyakarta Province (Indonesia) in real time. We evaluated
color. three machine learning algorithms to find the best algorithms to
TABLE VII classify tweets data in real-time. As for imbalanced and balanced
EVALUATION MEASUREMENT OF IMBALANCED DATASET (110,449 TWEETS) dataset, Support Vector Machine (SVM) algorithm produced the
best performance using all words as features, while Decision
Feature Model Accuracy Precision Recall F-score Tree (DT) algorithm yielded best performance using only
NB 99.76% 98.94% 99.52% 99.23% selected words as feature. Experimental results show that SVM
All
SVM 99.87% 99.41% 99.80% 99.60% algorithm produced the best performance among the other
words
DT 99.70% 98.76% 99.34% 99.05% algorithms with 99.77% and 99.87% of classification accuracy
NB 99.23% 95.74% 99.39% 97.53% in balanced and imbalanced data, respectively. We also found
Selected
SVM 99.23% 96.43% 98.68% 97.54% that feature selection algorithm used in this research did not
words
DT 99.42% 96.75% 99.57% 98.14% improve accuracy. Furthermore, feature selection and the
amount of data affected the performance of classification model.
Further research is needed to investigate appropriate approach
As for imbalanced dataset with 110,449 tweets and all words for better classification regardless the amount of mined Twitter
as features, SVM yielded the best performance in all data.
measurement as shown in Table VII while DT produced best
accuracy with only selected words as features. From Table VI
and Table VII, we can see that the amount of data affected REFERENCES
classification performance. More data produced better [1] T. Sakaki, M. Okazaki, and Y. Matsuo, Earthquake shakes Twitter users:
classification performance. Furthermore, improvements real-time event detection by social sensors, in Proceedings of the 19th
depends on the implemented algorithms For instance, SVM was international conference on World wide web, 2010, pp. 851860.
[2] E. DAndrea, P. Ducange, B. Lazzerini, and F. Marcelloni, Real-Time
found to be quite sensitive to imbalanced dataset [10]. The Detection of Traffic From Twitter Stream Analysis, IEEE Trans. Intell.
above-mentioned results were due to no feature selection needed Transp. Syst., vol. 16, no. 4, pp. 22692283, Aug. 2015.
by SVM to improve accuracy [11]. On the contrary, feature [3] T. Sakaki, Y. Matsuo, T. Yanagihara, N. P. Chandrasiri, and K. Nawa,
selection affected DT performance since too much and too Real-time event extraction for driving information from social sensors,
specific features produced unneeded tree branch that caused in 2012 IEEE International Conference on Cyber Technology in
Automation, Control, and Intelligent Systems (CYBER), 2012, pp. 221
overfitting [12]. 226.
TABLE VIII [4] Y. Gu, Z. (Sean) Qian, and F. Chen, From Twitter to detector: Real-time
TRAINING TIME OF MODELS traffic incident detection using social media data, Transp. Res. Part C
Emerg. Technol., vol. 67, pp. 321342, Jun. 2016.
[5] C. Gutirrez, P. Figuerias, P. Oliveira, R. Costa, and R. Jardim-
Training time Training time
Goncalves, Twitter mining for traffic events detection, in Science and
Feature Model 35,184 tweets 110,449 tweets Information Conference (SAI), 2015, 2015, pp. 371378.
(seconds) (seconds) [6] A. Wibisono, I. Sina, M. A. Ihsannuddin, A. Hafizh, B. Hardjono, A.
NB 1.068 2.129 Nurhadiyatna, W. Jatmiko, and d P. Mursanto, Traffic intelligent system
All
SVM 1.510 4.011 architecture based on social media information, in 2012 International
words Conference on Advanced Computer Science and Information Systems
DT 3.660 18.560
NB 1.335 4.642 (ICACSIS), 2012, pp. 2530.
Selected [7] S. F. Rodiyansyah and E. Winarko, Klasifikasi Posting Twitter
SVM 2.332 7.483 Kemacetan Lalu Lintas Kota Bandung Menggunakan Naive Bayesian
words
DT 1.372 4.793 Classification, IJCCS-Indones. J. Comput. Cybern. Syst., vol. 7, no. 1,
pp. 1322, 2013.
[8] N. Monarizqa, L. E. Nugroho, and B. S. Hantono, Penerapan Analisis
We evaluated training time as another aspect of classification Sentimen pada Twitter Berbahasa Indonesia sebagai Pemberi Rating,
model. The training time displayed in Table VIII is average Universitas Gadjah Mada, Perpustakaan Pusat UGM, 2014.
training time of ten folds cross validation. As shown in Table
2016 8th International Conference on Information Technology and Electrical Engineering (ICITEE), Yogyakarta, Indonesia

[9] C. D. Manning, P. Raghavan, H. Schtze, and others, Introduction to [11] T. Joachims, Text Categorization with Support Vector Machines:
information retrieval, vol. 1. Cambridge university press Cambridge, Learning with Many Relevant Features, in European Conference on
2008. Machine Learning (ECML), Berlin, 1998, pp. 137142.
[10] R. Batuwita and V. Palade, Class Imbalance Learning Methods for [12] R. Garreta and G. Moncecchi, Learning scikit-learn: machine learning in
Support Vector Machines, in Imbalanced Learning: Foundations, Python: experience the benefits of machine learning techniques by
Algorithms, and Applications, Haibo He and Yunqian Ma (Eds.), Wiley, applying them to real-world problems using Python and the open source
(book chapter), 2013. scikit-learn library. 2013.

You might also like