Professional Documents
Culture Documents
I.
I NTRODUCTION
Metsis et al. [9] exploited a language independent, contentbased feature called term frequency (TF). Using this feature
with different variants of the Nave Bayes algorithm, they
found results that are better than those found by a number
of header-based methods. Likewise, studies are continuously
reporting improved results with email-text features (see Prabhakar and Basavaraju [10], and Ma et al. [11])most of which
perform better than header features.
The proposed method utilizes text features that are longestablished such as frequency of spam words and HTML tags
as well as some that are new. The novelty of our work is
that we introduce language-centric features such as grammar
and spell errors, use of function words, presence of verbs and
alpha-numerics, TFIDF, and inverse sentence frequency. In
addition, we use features related to message readability (e.g.,
reading difficulty indexes, complex and simple word frequency,
document length, and word length). The features are extracted
from four standard email datasetsCSDMC2010, SpamAssassin, LingSpam, and Enron-Spam. The features are then
used by five well-known learning algorithmsRandom Forest (RF), BAGGING, ADABOOSTM 1, Support Vector Machine
(SVM), and Nave Bayes (NB)to induce binary classifiers.
From our extensive experiment, we find that the addition of
text and readability features significantly improves classifier
performance compared to that found with the commonplace
features alone. Another notable finding is that the classifier
induced by BAGGING performs the best on all of the datasets.
In addition, its performance surpasses that of a number of stateof-the-art methods proposed in previous studies. Like Metsis
et al. [9], our features are language independent. Whereas we
derive the results solely from English emails, the proposed
approach may be an excellent means to classify spam emails
in any language.
The next section details the features used in this research.
Following that, we describe the learning algorithms, performance evaluation measures, and datasets. In Section VI, we
report the experimental findings, and in Section VII we provide
some discussion of these results, especially as compared with
previous filtering methods. Finally, Section VIII concludes the
paper.
II.
F EATURE S ELECTION
Groups
Traditional Features
Text Features
Readability Features
Spam Words
Alpha-numeric Words
TF IDF
FI
FKRI
Simple Words
TF IDF simple
Features
Anchors
HTML Non-anchors
Verbs
Function Words
Grammar Errors
Spelling Errors
FRES
SMOG
Simple Word FI
Inverse FI
Document Length
Word Length
HTML
HTML
Tags
TF ISF
Language Errors
FORCAST
Complex Words
TF IDF complex
A. Traditional Features
Below are descriptions of the four typical features for spam
classification that are used in our experiment.
1) Dictionary-based Features: The first feature in this
group is the frequency of spam words. The selection of
this feature is inspired by the interesting findings of Graham
[12]. He showed that merely looking for the word click in
the messages can detect 79.7% of spam emails in a dataset
with only 1.2% false positives. To exploit this feature, we
have developed a dictionary comprising 381 spam words1
accumulated from various anti-spamming blogs. We treated
each message as a bag-of-words and counted the frequency of
spam words in them as follows:
X1 : Spam Words = #Spam Words.
2) HTML Features: The frequency of HTML tags in emails
is a common means to classify spams. We have subdivided this
feature into three features: (i) frequency of anchor tags (i.e.,
number of close-ended tags <a> and </a>), (ii) frequency of
tags that are not anchors, anchor (e.g., <p> or <br>), and
(iii) total HTML tags in the emails (e.g., sum of (i) and (ii)). To
identify HTML tags in the emails, we used a Java HTML parser
called jsoup2 . Each feature is normalized by the length of the
email, N, which is the number of sentences in the message.
The detailed HTML feature calculations are:
#anchor tags
X2 : anchor =
,
N
#anchor tags
X3 : anchor =
, and
N
TFt
while its
ISFt
ISF
is:
N ,
= log SF
t
DFt
1 http://cogenglab.csd.uwo.ca/sentinel/spam-term-list.html
3 http://nlp.stanford.edu/software/tagger.shtml
2 http://jsoup.org/download
4 http://cogenglab.csd.uwo.ca/sentinel/function-word-list.html
X15 :
q
SMOG Index = 1.043 30
Complex Words
+ 3.1219.
N
The FORCAST index was originally formulated to assess
the reading skills required by different military jobs without
focusing on running narratives [17]. The index, unlike others,
emphasizes the frequency of simple words. The following is
the way to calculate the FORCAST index of a message:
W,
X16 : FORCAST Index = 20 10
where W is the number of simple words in a 150-word sample
of the text.
A second instalment of a readability index proposed by
Flesch and further investigated and modified by Kincaid [18]
is known as the Flesch-Kincaid Readability Index (FKRI). This
feature can be calculated as follows:
X17 : FKRI
=P
P
Syllables
Words + 11.8
P
15.59.
0.39
N
Words
In addition, we have modified FI in two different ways
to give two additional features. We chose to modify FI only,
because empirical outcomes showed that these modifications
of FI provide better results than those of other scores.
First, in X13 , in the place of complex words, we substituted
the frequency of simple words:
X18 :
P
P
Simple
Words
Words + 100
P
FIsimple = 0.4
.
N
Words
X13 :
P
P
Complex
Words
Words + 100
P
FI = 0.4
.
N
Words
Second, we used the Flesch Reading Ease Score (FRES)
[15], one of the oldest readability scores. The FRES of any
given message can be found as follows:
X14 : FRES =
P
P
Syllables
Words 84.6
P
206.835 1.015
N
Words .
The SMOG index, when first published, was anticipated as
a proper substitute for FI due to its accuracy and ease of use
[16]. It is the third readability score used as a feature:
5 http://www.languagetool.org/java-api/
1.
X19 : Inverse FI = FI
2) Frequency-based Features: We also considered the frequency of Complex Words and Simple Words in the messages
as two more features:
X20 : Complex Words = #Complex Words, and
X21 : Simple Words = #Simple Words
#Syllables
.
#Words
Learning Algorithms
Random Forest (RF)
ADABOOSTM 1
BAGGING
Parameters
Maximum Depth: Unlimited
Number of Trees to be Generated: 10
Random Seed: 1
Number of Iterations: 10
Random Seed: 1
Resampling: False
Weight Threshold: 100
Size of Bag (%): 100
Out of Bag Error: False
Number of Iterations: 10
Random Seed: 1
SVM Type: C-SVC
Cost: 1.0
Degree of Kernel: 3
EPS: 0.0010
Gamma: 0.0
Kernel Type: Radial Basis
Epsilon: 0.1
Probability Estimates: False
Shrinking Heuristics: True
Use of Kernel Estimator: False
L EARNING A LGORITHMS
BOOSTED RF
and
BAGGED RF ,
re-
E VALUATION M EASURES
Prediction
Spam
Ham
Actual
Spam
Ham
nss
nhs
nsh
nhh
Dataset
CSDMC2010
SpamAssassin
LingSpam
Enron-1
Enron-2
Enron-3
Enron-4
Enron-5
Enron-6
Total Messages
4, 327
6, 046
2, 893
5, 172
5, 857
5, 512
6, 000
5, 175
6, 000
Spam Rate
31.85%
31.36%
16.63%
29.00%
25.54%
27.21%
75.00%
71.01%
75.00%
Text Pre-processed?
No
No
Yes
Year of Curation
2010
2002
2000
Yes
2006
nss
nss +nhs .
nss
nss +nsh .
nhh +nss
nhh +nhs +nsh +nss .
nhs
nhs +nhh .
In contrast,
the user inbox:
FNR
FNR
nsh
nsh +nss .
Note that the lower the FPR and FNR, the better the
performance. Considering the situation where users might
accept spams to enter into their inbox but they do prefer their
hams not to end up in the spam-traps, a higher FPR is more
expensive than a higher FNR. To resolve this, we need a balance
between FPR and FNR, which is the AUC in this case. The
AUC is measured using the Receiver Operating Characteristics
(ROC) curves. The ROC curve is a 2-D graph whose Y-axis is
the true positive rate (which is indeed 1FNR) and X-axis is
the FPR, and therefore depicts the trade-offs between the cost
of nsh and nhs .
V.
DATASETS
40
Importance
Importance
30
20
30
20
10
10
randMin
randMean
randMax
X14.
X23.
X6
X16
X14
X23
X6.
X13
X19
X22
X18
X15.
X18.
X20.
X15
X20
X16.
X8.
X8
X13.
X19.
X5
X5.
X17
X21.
X21
X17.
X3
X7
X24
X11
X2
X4
X10
X9
X25
X12
X1
X1.
randMin
randMean
randMax
X23
X23.
X16
X14
X14.
X6
X6.
X15.
X15
X19.
X13.
X22
X21
X8
X13
X19
X20.
X21.
X20
X17.
X8.
X5
X5.
X16.
X17
X18.
X18
X7
X2
X9
X24
X3
X11
X25
X4
X10
X12
X1
X1.
Features
Features
(a) The order of the features according to their importance for the
CSDMC2010 dataset.
(b) The order of the features according to their importance for the
SpamAssassin dataset.
40
Importance
30
20
10
randMin
randMean
X4
X2
X3
X23
randMax
X19
X17
X13
X22
X8
X14
X9
X17.
X19.
X13.
X24
X6.
X6
X8.
X16
X5.
X5
X14.
X15
X16.
X15.
X21.
X7
X21
X10
X25
X18
X20
X20.
X18.
X12
X23.
X11
X1.
X1
Features
(c) The order of the features according to their importance for the
LingSpam dataset.
E XPERIMENTAL R ESULTS
A. Feature Importance
In machine learning, the identification of features relevant for classification is important, so is the observation
of results when features work in groups. This identification
RF
BOOSTED RF
BAGGED RF
SVM
NB
Baseline
Baseline + Text
Feature
0.060
0.062
0.055
0.022
0.031
0.034
0.028
0.023
0.037
0.065
Baseline +
Readability
Features
0.038
0.028
0.023
0.023
0.093
All
0.040
0.030
0.020
0.027
0.101
(a) FPR of different groups of features combined with the baseline for the CSDMC2010 dataset.
RF
BOOSTED RF
BAGGED RF
SVM
NB
Baseline
Baseline + Text
Feature
0.064
0.071
0.061
0.055
0.060
0.040
0.029
0.026
0.063
0.075
Baseline +
Readability
Features
0.045
0.034
0.028
0.050
0.094
All
0.034
0.026
0.022
0.052
0.104
(b) FPR of different groups of features combined with the baseline for the SpamAssassin dataset.
RF
BOOSTED RF
BAGGED RF
SVM
NB
Baseline
Baseline + Text
Feature
0.041
0.044
0.040
0.0004
0.063
0.019
0.018
0.011
0.012
0.091
Baseline +
Readability
Features
0.023
0.016
0.015
0.020
0.214
All
0.017
0.017
0.009
0.014
0.218
(c) FPR of different groups of features combined with the baseline for the LingSpam dataset.
RF
BOOSTED RF
BAGGED RF
SVM
NB
Baseline
Baseline + Text
Feature
0.404
0.404
0.402
0.424
0.408
0.164
0.156
0.143
0.371
0.304
Baseline +
Readability
Features
0.245
0.240
0.218
0.442
0.376
All
0.174
0.158
0.150
0.416
0.336
(d) FPR of different groups of features combined with the baseline for the Enron-Spam dataset.
TABLE V: The effect of incrementing groups of features on spam classification for (a) CSDMC2010, (b) SpamAssassin, (c)
LingSpam, and (d) Enron-Spam Datasets. The FPRs that are not statistically significant compared to the comparable baseline are
written in italics.
FPRs
RF
BOOSTED RF
BAGGED RF
SVM
NB
FPR
FNR
0.040
0.030
0.020
0.027
0.101
0.092
0.089
0.107
0.390
0.396
Accuracy %
94.338
95.124
95.193
85.718
80.471
Precision
0.914
0.934
0.953
0.913
0.737
Recall
0.908
0.912
0.893
0.610
0.604
F-score
0.911
0.922
0.922
0.730
0.662
AUC
0.980
0.980
0.988
0.792
0.855
(a) Evaluation measures of the full-featured spam classifiers for the CSDMC2010 dataset.
RF
BOOSTED RF
BAGGED RF
SVM
NB
FPR
FNR
0.034
0.026
0.022
0.052
0.104
0.093
0.079
0.099
0.292
0.558
Accuracy %
94.707
95.700
95.353
87.265
75.373
Precision
0.923
0.941
0.948
0.861
0.660
Recall
0.907
0.921
0.901
0.708
0.443
F-score
0.915
0.931
0.924
0.777
0.529
AUC
0.979
0.982
0.986
0.828
0.847
(b) Evaluation measures of the full-featured spam classifiers for the SpamAssassin dataset.
RF
BOOSTED RF
BAGGED RF
SVM
NB
FPR
FNR
0.017
0.017
0.009
0.014
0.218
0.162
0.162
0.193
0.341
0.277
Accuracy %
95.817
95.886
95.956
93.156
77.186
Precision
0.907
0.910
0.944
0.907
0.402
Recall
0.838
0.838
0.807
0.659
0.723
F-score
0.869
0.871
0.868
0.760
0.515
AUC
0.978
0.977
0.986
0.822
0.831
(c) Evaluation measures of the full-featured spam classifiers for the LingSpam dataset.
RF
BOOSTED RF
BAGGED RF
SVM
NB
FPR
FNR
0.174
0.158
0.150
0.416
0.336
0.072
0.070
0.080
0.379
0.302
Accuracy %
91.681
92.288
92.521
78.350
73.344
Precision
0.887
0.896
0.910
0.836
0.680
Recall
0.927
0.929
0.919
0.620
0.697
F-score
0.906
0.912
0.914
0.627
0.686
AUC
0.960
0.956
0.972
0.602
0.750
(d) Average evaluation measures of the full-featured spam classifiers for the Enron-Spam
dataset.
TABLE VI: Performances of the classifiers on (a) CSDMC2010, (b) SpamAssassin, (c) LingSpam, and (d) Enron-Spam Datasets.
RF
BOOSTED RF
BAGGED RF
SVM
NB
FPR
FNR
0.187
0.159
0.125
0.059
0.176
0.320
0.329
0.369
0.578
0.634
Accuracy %
71.411
71.510
69.473
55.772
48.529
Precision
0.796
0.805
0.809
0.799
0.712
Recall
0.714
0.712
0.695
0.558
0.485
F-score
0.731
0.732
0.713
0.570
0.497
AUC
0.830
0.843
0.855
0.681
0.645
(a) Classifiers trained on ham skewed Enron 1-3 and tested on spam skewed Enron 4-6.
RF
BOOSTED RF
BAGGED RF
SVM
NB
FPR
FNR
0.458
0.445
0.379
0.842
0.734
0.077
0.069
0.061
0.028
0.089
Accuracy %
64.494
65.709
70.793
37.833
44.048
Precision
0.808
0.815
0.834
0.763
0.732
Recall
0.645
0.657
0.708
0.378
0.440
F-score
0.661
0.673
0.723
0.321
0.425
AUC
0.865
0.887
0.908
0.564
0.717
(b) Classifiers trained on spam skewed Enron 4-6 and tested on ham skewed Enron 1-3.
C ONCLUSIONS
Available
on:
[22] M. Ye, T. Tao, F.-J. Mai, and X.-H. Cheng, A spam discrimination
based on mail header feature and svm, in Fourth International Conference on Wireless Communications, Networking and Mobile Computing
(WiCom08), 2008, pp. 14.
R EFERENCES