Professional Documents
Culture Documents
SYS 4021
Donald E. Brown
Laura Barnes
brown@virginia.edu
lb3dp@virginia.edu
Summary
In this study, I use logistic regression model to build static filter design, i.e. to classify e-mails as spam and
ham. I find that there are 3 important variables that need to be considered in filtering out spam, i.e. frequency
of some words/characters in the message, longest run-length of capital letter, and total run-length of capital
letters. The variables are highly significant in the logistic regression model at 5% level. It is important to
transform the predictors into a log-scale as this will increase the accuracy of the model. The final model
selected for spam filtering has the highest accuracy with smaller total errors (13.4%) and false positives
(7.9%) made. It also fits better based on BIC criteria. For spam filtering, I also build time series filter design
to predict the daily amount of spam e-mails. I found that there is a relationship between the amount of spam
e-mails received and time of arrivals. Time is highly significant in the linear regression model at 5% level.
For spam data, the residuals can be modeled by ARMA model with 2 autoregressive (AR) terms and 1
moving average (MA) term and this model gives the best forecast. Meanwhile, for ham data, the residuals
can be modeled by ARIMA(1,1,1) and this models gives the best forecast with MSE 2.0. ARIMA(1,1,1)
model also has the lowest AIC and BIC values. Both models shows adequacy from the Ljung-Box Q-statistic
plot since all the points are insignificant. The static and time series filter design can be integrated to produce
an overall filter design by using Bayes rule. It means that for any email that comes into my classifier, the
probability of getting a spam e-mail is determined by the probability of my e-mail is spam based on the static
filter and the probability of my e-mail is spam based on the time series filter.
Honor pledge: On my honor, I pledge that I am the sole author of this paper and I have accurately cited all
help and references used in its completion.
Imran A. Khan
November 10, 13
1. Problem Description
1.1. Situation
E-mail is the most affordable, easiest, and fastest means of communication. To many email users around the
world, unsolicited junk or spam is one of the biggest problems facing the internet today. Spam often contains
annoying advertisement for services or products and adult content or other inappropriate material. It wastes
peoples time and important bandwidth of internet connection. The damage of spam has developed into a
serious technical, economic, and social threat. The cost of spam in terms of lost productivity is estimated to
be about $21.58 billion annually, according to the 2004 National Technology Readiness Service
[7]
. The
survey reveals that the internet users in the US spend about 3 minutes deleting spam e-mails every day. This
comes to 22.9 million hours a week or $20.58 billion based on an average wage across 169.4 million online
adults in the US.
Spammers are dumping a lot on society and reaping fairly little in return as noted by Rao and Reiley in their
report The Economic Spam
[6]
. With 94 billion spam messages sent daily, it's costing society around $20
billion and the revenue adds up to only $200 million. Although the revenue of spamming is very low, it is
really cheap for spammers to send a massive amount of e-mail to users. They estimates that spammers only
need 1 in 25000 receivers to buy something to have a good profit. Comparing spam cost to the advertising
media as shown in Table 1, I can see that the cost for spam mailing (per thousand impressions) is the lowest
one. It is also very easy for spammers to break even with a marginal profit of $50.
Table 1. Cost of spam advertising relative to other advertising media (cost per thousand impressions)
Breakeven conversion with
marginal profit = $50.00
Per 100,000
Advertising vector
CPM
Percent
deliveries
Postal direct mail
Super Bowl advertising
Online display advertising
Retail spam
$250-$1,000
2-10%
2000
$20
0.04
40
$1-$5
0.002-0.006%
$0.10-$0.50
0.001-0.002%
0.3
$0.03
0.00%
0.06
$0.05
0.00%
0.1
The number of spam e-mail sent every 24 hours has increased from year to year as the number of e-mail
users has increased too and this causes more lost productivity every year. As shown in Figure 1, about 8% of
all e-mails are identified as spam in 2001 and it only increase about 1% in 2002. But in 2003, it has a
significant increase to 40% of e-mails scanned as spam and it continuously increasing until 2010. I can see
that e-mail spam rate has a significant drop in the last two years. According to Symantec Intelligence Report
1
issued in February 2012 [4], global spam levels continued to fall, as it now accounts for 68% of global email
traffic. However, spam still becomes a major problem for many companies and individual email users.
Therefore, it is very important to classify the emails to spam and non-spam so that loss of productivity can be
reduced.
Figure 1. Spam rate over time
100%
86.2%
90%
80%
84.6%
87.7%
89.1%
81.2%
75.1%
72.3%
68.6%
70%
68.0%
60%
50%
40.0%
40%
30%
20%
10%
8.0%
9.0%
2001
2002
0%
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
Many studies have been conducted to deal with spam. One way to filter out spam message is by identifying
word patterns or word frequency or identifying certain symbols in the message. Based on an e-mail database
collected at Hewlett-Packard Labs[1],
[2], [5]
you, your, money, and $. As shown in Table 2, the frequency of these words make a big difference
in differentiating between wanted and unwanted e-mails and they are more likely to appear in spam message.
Table 2. Words/symbols average frequency appear in spam and ham e-mails
Spam
Ham
our
0.51
0.18
free
0.52
0.07
you
2.26
1.27
your
1.38
0.44
money
0.21
0.02
0.17
0.01
The average, longest and total run-length of capital letters in a message also can be used to filter out spam.
These 3 indicators shows that spam tends to use more capital letters in the message (Table 3).
2
Table 3. The average, longest and total run-length of capital letters in a message
Spam
Ham
Average
9.5
2.4
Longest
104.4
18.2
Total
470.6
161.5
1.2. Goal
The purpose of this study is to build a static filter to determine if a certain e-mail is spam or not based on a
certain words, characters, and the length of capital letters spam. It is also of interest to build a time series
filter to detect e-mail spam.
1.3. Metrics
Spam variable, a categorical variable that indicates the type of the mail and is either ham or spam, is
used as a response variable to build a static filter by fitting logistic regression models. I utilize AIC and BIC
criteria to compare the performance between the fitted models. I use decision threshold of 0.5 to evaluate
different models using the errors made by the decision functions for that model. In addition, ROC curve is
also plotted to evaluate the ability of the logistic models to distinguish between spam and ham e-mails.
Count variable that indicates the amount of spam and ham e-mails received per day is considered to build
time series filter. I utilize AIC and BIC criteria to assess the different fitted time series models. I use MSE to
evaluate model with the best forecast.
For both static and time series analysis, I divide the data into two parts, i.e. training set and test set. Training
set is used for model building and test set is used to evaluate the model performance. I use significance level
of 5% for the analysis throughout this study. If the confidence level (p-value) is less than 0.05, then my (null)
hypothesis is rejected in favor of the alternative. Alternatively, if p-value is greater than 0.05, then my null
hypothesis should not be rejected.
1.4. Hypotheses
I have two hypotheses in this study: one for static filter and one for time series filter.
Hypothesis 1:
Ho: There is no relationship between word frequency and sequence of capital letters to spam
H1: There is a relationship between word frequency and sequence of capital letters to spam
Hypothesis 2:
Ho: There is no relationship between time of arrival and spam
H1: There is a relationship between time of arrival and spam
2. Approach
2.1. Data
Two datasets are used in this study. The first dataset is used to build a static filter and it comes from HewlettPackard Labs created by Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt. The spam data
contains 4601 observations and 58 variables
are continuous real and indicate the frequency of certain words and characters in the e-mail. The last variable
indicates if a mail is a spam (1) or ham (0). A detailed description about the variables is summarized in Table
A1 and A2 (Appendix A). There are no missing values in the dataset. In total, there are 1813 (39.4%) e-mails
labeled as spam and 2788 (60.6%) labeled as ham.
The second dataset is used to build time series filters. It is a time series data that consists of two e-mail
databases, i.e. spam and ham e-mails[2], [3]. There are 364 observations and 4 variables in the spam database
and they are collected from 2004-2005. There are 506 observations and 4 variables in the ham database and
they are collected from 2000-2001. The first 3 variables indicate the year, the month, and the day and the last
variable indicates the amount of e-mails received. There are no missing values in both datasets.
2.2. Analysis
2.2.1.
To build spam filter model, I perform logistic regression analysis using R software. The stages of data
analysis are as follows:
1. Divide the dataset into training set (2/3) and test set (1/3). Training set is used to build logistic regression
models and test set is used to evaluate the performance of the fitted models.
2. Build main effect model by taking into account a smaller set of predictor variables, no informative
predictors are not included. Variable that shows discriminatory of spam in factor plot is selected as a
candidate of predictor in the main model. Principal Component Analysis (PCA) is also used to select
variables that have high loadings in the first and second principal component.
4
3. Build an interaction model by including relevant interactions in the main effect model. This is done by
performing a logistic regression with only two predictor variables, including the interaction between
them, to predict the outcome. The interaction plot displays the predicted probability of the response
variable with the interaction between the two predictors displayed as factors (categorical variables). The
predictor variables are categorized into two levels:
= {
, <
, for = 1,2, , and = 1,2, ,
,
where is the mean for , n is the number of observations, and p is the number of predictor variables.
The interaction is selected when the estimated parameter of the two-way interaction is significant in the
model and it shows a crossover interaction in the interaction plot. In other words, the two lines have
opposite slope or they intersect.
4. Perform stepwise selection procedure to select important variables in the main effects model as well as
in the interaction model. The reduced model is then examined by using partial likelihood test to check if
some of the predictors can be eliminated from the model.
5. Fit main effect and interaction models by using log scale for the predictors with offset 0.01, i.e. log( +
0.01). Repeat step 4 to reduce the model with logs transformation in the regression equation.
6. Fit Principal Component (PC) regression that accounts 98% of variability to the data before and after
log-transformation.
7. Compare main effects, interaction, and PC regression models before and after transformation by using
AIC and BIC criteria to select the best model.
8. Evaluate the model by computing the score table with the decision threshold 0.5 for each model and plot
the ROC curve to show the performance of all models.
Figure 2 displays the biplot for 1 57 . I can observe that ham and spam e-mails have the greatest variance
in the direction of the first and second principal component, respectively. Figure 3 displays the biplot for
1 57 after log-transformation and it shows that spam emails have about equal variance in the first and
second components. The two components for the log transformed variables explain more variance (27.9%)
as compared to the two components without transforming the variables (17.3%). The first three predictors
that have high loadings in the first and second components are taken into account in the main effects model.
For PCA before transformation: V32, V34, and V40 have high loadings in the first component and V21, V56,
and V23 have high loadings in the second component. For PCA after transformation: V53, V56, and V57 have
high loadings in the first component and V31, V32, and V34 have high loadings in the second component.
Based on factor plot, there are only 5 out of 57 variables that show discriminatory of spam. They are V3, V5,
V12, V19, and V21 as shown in Figure 4 and the factor plot for the rest of predictors are shown in Figure B1
and B2 in Appendix B. The mean frequency of these variables is clearly higher for spam than ham e-mails
and it is higher than zero.
The correlation matrix between the predictors selected based on PCA and factor plot is plotted in Figure 5.
The predictors show relatively small relationship between them, only few of them have correlation greater
than 0.5. It can be observed that V32 and V34 have a perfect correlation. It means no need to put them together
in the model because both variables convey the same information. It is also done to avoid multicolinearity.
Therefore, only V34 is retained in the model building.
Figure 5. Correlation matrix
In total, I have 12 potential predictors in my main effects model and they can be expressed in the following
logistic regression model (Model 1):
( = 1)
= 0 + 3 3 + 5 5 + 12 12 + 19 19 + 21 21 + 23 23 + 31 31 + 34 34 + 40 40 + 53 53
( = 0)
+ 56 56 + 57 57
It appears that the stepwise selection procedure produces the same main effects as Model 1, meaning no
predictors is eliminated from the model. However, I find that V34 is insignificant in the model thus I removed
this variable from the model. The partial likelihood test shows that V34 can be removed and the null
hypothesis cannot be rejected with p-value = 0.06 (Table A3, Appendix A) and thus the reduced model is
preferable (Model 2).
Figure 6. Interaction plot
Based on interaction plots, there are 4 interactions with crossover as shown in Figure 6 and the rest of the
interaction plots can be found in Figure B3 in Appendix B. Including these interactions in the main model,
my interaction model can be written as follow (Model 3):
( = 1)
= 0 + 3 3 + 5 5 + 12 12 + 19 19 + 21 21 + 23 23 + 31 31 + 34 34 + 40 40 + 53 53
( = 0)
+ 56 56 + 57 57 + 3,40 3 40 + 23,34 23 34 + 23,40 23 40 + 34,40 34 40
Model 3 can be reduced by removing 3 interaction terms in the model with stepwise selection procedure. The
partial likelihood test shows that the null hypothesis cannot be rejected with p-value = 0.70 (Table A4,
Appendix A) and thus the reduced interaction model is preferable (Model 4). Table 4 summarizes the
estimated parameters and the corresponding standard errors for the four models.
Table 4. Parameter estimates and standard errors for logistic regression models with untransformed data
Model 2:
Model 4:
Model 1:
Model 3:
The reduced main
The reduced
The main effect model
The interaction model
effect model
interaction model
Estimate Std. Error Estimate Std. Error Estimate
Std. Error
Estimate Std. Error
0
-2.362
0.100
-2.372
0.100
-2.357
0.100
-2.359
0.100
0.097
0.097
0.189
0.096
0.188
0.095
0.818
0.090
0.811
0.090
0.813
0.090
0.813
0.090
12
-0.316
0.074
-0.310
0.074
-0.324
0.075
-0.323
0.075
19
0.212
0.028
0.215
0.028
0.215
0.028
0.215
0.028
21
0.402
0.043
0.398
0.043
0.399
0.043
0.398
0.043
23
4.581
0.648
4.594
0.649
4.445
0.648
4.569
0.649
31
-17.470
3.963
-17.690
3.928
-16.790
4.029
-17.650
3.751
34
1.110
-1.589
1.138
0.396
0.420
-1.593
1.135
0.150
-1.516
0.383
0.423
0.148
40
1.071
0.534
1.079
0.536
53
7.413
0.727
7.444
0.729
7.372
0.723
7.396
0.724
56
0.016
0.002
0.016
0.002
0.016
0.002
0.016
0.002
0.0001
0.0001
0.0001
0.0001
8.310
3.094
8.436
2.824
75.090
27040.0
38.230
47.510
-50.240
99.810
57
3,40
0.0002
0.0002
23,34
23,40
34,40
0.0001
AIC
2370.1
2371.7
2362.7
BIC
2448.5
2444.1
2465.2
0.0002
2358.1
2442.5
*insignificant at 5% level
Using the same approach to build logistic regression model with untransformed variables, the main effects
model with transformed variables in the regression equation can be reduced by dropping one predictor, i.e.
V3. The partial likelihood test shows that the null hypothesis cannot be rejected with p-value = 0.29 (Table
A5, Appendix A). This means the reduced main effects model performs better. Similarly, the interaction
model with transformed variables can be reduced by dropping 3 interaction terms; the partial likelihood test
shows insignificant p-value (Table A6, Appendix A). Table 5 summarizes the estimated parameters and
standard errors for the four models with transformed variables in the logistic regression equation.
Table 5. Parameter estimates and standard errors for logistic regression models with transformed data
Model 2:
Model 4:
Model 1:
Model 3:
The reduced main
The reduced
The main effect model
The interaction model
effect model
interaction model
Estimate Std. Error Estimate Std. Error Estimate
Std. Error
Estimate Std. Error
0
-3.091
0.032
0.265
0.029
0.269
12
-0.181
0.028
19
0.162
21
0.203
0.033
1.362
-3.248
1.352
45.3*
10180.0
-2.050
1.505
0.221
0.221
0.029
0.264
0.029
0.264
0.029
-0.178
0.028
-0.181
0.028
-0.181
0.028
0.028
0.165
0.028
0.163
0.028
0.163
0.028
0.027
0.206
0.027
0.205
0.027
0.205
0.027
2217.0
0.430
0.064
0.426
0.423
23
0.429
0.064
0.433
0.063
16.9
31
-1.074
0.178
-1.083
0.178
-1.040
0.186
-1.057
0.184
0.222
9.764
2210.0
-0.710
0.223
-6.052
34
-0.713
0.222
-0.717
40
0.224
0.093
0.228
0.093
176.8
0.436
0.158
53
0.656
0.051
0.656
0.051
0.652
0.051
0.653
0.051
56
1.082
0.084
1.075
0.084
1.083
0.084
1.084
0.084
57
3,40
-0.491
0.080
-0.473
0.078
-0.492
23,34
23,40
34,40
0.080
-0.494
0.080
0.087
0.049
0.049
3.633
481.5
-0.051
0.083
-1.362
38.39
AIC
2112.6
2111.7
2116.2
BIC
2190.9
2184.0
2218.7
0.087
2111.3
2195.7
*insignificant at 5% level
PC regression is also fitted to the data before and after transformation by accounting 12 predictors selected in
the main effects model. There are 10 components that can explain 98% of the variation. The utility test
shows that the null hypothesis is rejected, meaning the 10 components cannot be removed from the model
(Table A7 and A8, Appendix A). The results are shown in Table 6.
10
Estimate
Std. Error
Estimate
Std. Error
0.159
0.089
-0.829
0.082
Comp.1
1.403
0.150
-1.169
0.045
Comp.2
-2.977
0.174
-1.017
0.155
Comp.3
1.080
0.141
0.406
0.055
Comp.4
0.637
0.192
-0.694
0.066
Comp.5
-0.397
0.072
0.708
0.068
Comp.6
0.210
0.114
-0.022
0.059
Comp.7
-0.189
0.061
-0.327
0.065
Comp.8
0.004
0.102
-0.204
0.079
Comp.9
0.182
0.177
0.507
0.102
Comp.10
-2.279
0.274
0.552
0.124
(Intercept)
AIC
2433.7
2206.7
BIC
2500.0
2273.0
*insignificant at 5% level
2.2.2.
To build time series filter design, I perform time series analysis using R software. The stages of data analysis
are as follows:
1. Plot periodogram to discover the periodic components of a time series.
2. Model the trend and seasonality components of spam and ham by using linear regression model with
the amount of e-mails received per day is used as the response variable. The model building process
uses all the data except the last 7 days. The last week data is used for forecasting.
3. Get the residuals from the fitted regression model to check if they show correlation. If the residuals
show non-stationary, then the first differences of the residuals is considered.
4. Examine ACF and PACF plots to estimate the order p for AR model and the order q for MA model.
5. Select several candidate models by using AIC and BIC criteria and also use diagnostic plot to check
for model adequacy. MSE is also considered to evaluate the performance of the models for
forecasting.
Figure 7 displays the amount of spam and ham e-mails received per day. The amount of spam e-mails
mostly ranges from 10 to 40 mails per day. The time series goes up and down slowly between the
average 27 spam emails per day. Although the amount of spam emails has seen a few drops recently, it
tends to upward a little bit with some significant peaks at 40 to 60 spam emails. Meanwhile, the majority
11
of ham e-mails fluctuate from 0 to 10 e-mails per day with the average number of daily ham e-mails is
about 4. The series tends to wander up and down the average for a while and then drop to zero.
Figure 7 also displays the periodogram for spam and ham data. The spam periodogram shows no
obvious seasonality component. It reaches the highest peak with the lowest frequency of just less than
0.1, i.e. at period 375 days. The ham periodogram shows a weekly seasonality component with the
highest peak at frequency 0.15, i.e. at period 6.7 days.
The trend and daily seasonality component for the amount of spam and ham e-mails received per day are
modeled by the following regression model:
= 0 + 1 + 2 + 3 + 4 + 5 + 6 + 7 +
where is the amount of spam/ham e-mails at day i; is time/day; Sun, Mon, Tue, Wed, Thu, and Fri
are indicator for daily seasonality with Saturday as the base case.
12
I find that time is a significant component to the amount of spam e-mails received per day. T-test shows
that 1 is not equal to zero with p-value < 0.001 (Table 7). This means that time is an important factor to
the amount of spam e-mails with a very slow increasing trend with the increasing of time daily. Also, it
appears that daily seasonality is not an important component in the spam regression model. The partial
F-test shows that daily seasonality can be removed from the model with F-stat = 0.813 and p-value =
0.56. Thus, the spam time series data can be modeled by using the trend time only. This confirms the
periodogram where there is no obvious seasonality observed for the spam data. For ham time series data,
seasonality is an important component in the regression model and cannot be dropped from the model.
The estimated regression parameters and standard errors for spam and ham regression models are given
in Table 7. The two models are found significant overall at 5%. Since regression models of time series
frequently have correlated residuals, is then corrected by a time series model.
Table 7. The estimated paramaters and standard errors for spam and ham linear regression model
Parameter
Estimate
Std. Error
Estimate
Std. Error
23.997
0.881
0.570
0.508
0.019
0.004
-0.001
0.001
0.240
0.603
4.495
0.603
5.510
0.603
5.286
0.603
5.136
0.601
7
R2
4.707
0.601
Adjusted R2
Overall model
5.3%
28.3%
5,1%
F-statistic: 19.99 on 1 and 335
DF, p-value: 1.05e-05
27.3%
F-statistic: 27.73 on 7 and
491 DF, p-value: < 2.2e-16
The histogram of spam residuals is normally distributed and the histogram of ham residuals is skewed to
the right. The plot of residuals from linear regression fit shows that the run sequences have non-constant
mean. It seems that the mean is changing over time, thus the regression residuals for both spam and ham
are considered to be non-stationary and it may be necessary to take the first difference of the residuals
(Figure 8).
13
However, I can still consider several potential ARIMA models by looking at the ACF and PACF plot for
the residuals before taking any differences. I can observe that the sample ACF decay relatively slowly
and the sample PACF plot has insignificant peak after lag 3. Thus, AR(3) is one of the candidate model.
However, I will also consider ARMA model by including AR and MA terms in the model. Although the
ARMA lags cannot be selected solely by looking at the ACF and PACF, but it seems no more than 3 AR
and 3 MA terms are needed for spam residuals and no more than 2 AR and 3 MA terms are needed for
ham residuals. To identify the best lags, I fit several models with all combinations of p and q. I also use
automated procedure to find the best ARIMA model. For spam residuals, ARIMA(1,0,1) is the best fitted
model based on AIC/BIC criteria and automated procedure. ARIMA(2,0,1) is the second best model
based on AIC/BIC criteria. For ham residuals, ARIMA(2,0,1), ARIMA(1,0,1), and ARIMA(1,0,2) are
best fitted model based on AIC, BIC, and automated procedure, respectively. I find that ARIMA(0,1,1)
and ARIMA(1,1,1) are best fitted model for spam and ham residuals, respectively, based on AIC and
BIC criteria. The AIC and BIC values for all potential models are summarized in Table A9 (Appendix
A).
After taking the first difference of the residuals, the run sequence plot indicates that the data have a
constant location and variance; this means the resulting time series appear to be stationary for both spam
and ham residuals (Figure 9). The ACF plot shows that the autocorrelation at lag 1 exceeds the
significance bound, but all other autocorrelation between 1-25 do not exceeds the significance bound.
Since the ACF is zero after lag 1 and the PACF decays more slowly, then I can consider ARIMA(0,1,1)
as another candidate model. I also take into account MA term for the time series of first differences since
14
I can consider that there is a significant spike at lag 3 for spam residuals in the PACF plot. I can also
consider the same number of MA terms added for ham residuals model since its not necessary to take
into account lags at high number. After fitting all combination of p and q for the first differences, I find
that ARIMA(0,1,1) and ARIMA(1,1,1) are the best fitted model for spam and ham residuals,
respectively, based on AIC and BIC criteria (Table 9, Appendix A).
Figure 9. The first difference of residuals plot
2.2.3.
(|)()
()
It expresses the conditional probability, which represents the probability of an event A occurring given some
evidence B. This equation can be used to derive the mathematical model to integrate the static and time
series filters. The purpose is to calculate the probability of spam occurring given two evidences, i.e. the
probability of spam e-mail and the forecast amount of spam at time arriving:
( = | = , = ) =
( = , = | = )( = )
= , = | = )( = )
1=0 (
where:
15
S be the static filter indicates the message is spam (1) or non-spam (0)
T as the time series filter indicates the message is spam (1) or non-spam (0)
Assuming S and T are completely independent then ( = 1, = 1| = 1) can be decomposed into two
terms:
( = 1, = 1| = 1) = ( = 1/ = 1)( = 1/ = 1)
The first term is the probability that the static filter gives positive evidence (spam) given my e-mail is a
spam. This is calculated as a true positive rate, i.e. TP / (TP + FN), where TP is a true positive and FN is a
false negative. The second term is the probability that time series filter gives positive evidence (spam) given
my e-mail is a spam. This is calculated as the number of spam received at time t divided by total e-mails
(spam and ham) received at time t. In other words, ( = 1/ = 1) =
#
.
# #
Meanwhile ( = 1)
is calculated as the probability of a message is a spam. Using the data from Hewlett-Packard Labs, I can
calculate that the probability of a spam message is 0.4.
3. Evidence
3.1.
Reducing the number of predictors in logistic regression model is able to increase the performance of the
fitted model because the AIC and BIC values are smaller than the full model. But this is not the case for the
main effects model with untransformed data since AIC/BIC value is smaller for Model 1. For untransformed
data, the reduced interaction model fits better. It has the smallest AIC (2358.1) and BIC (2442.5) values. For
transformed data, the reduced interaction model (Model 8) fits better based on AIC and the reduced main
effects model fits better based on BIC criteria. In general, logistic regression models with transformed
variables have better performance because the AIC and BIC values are smaller than logistic regression
model with untransformed variables (Table 8).
16
BIC
2370.1
2448.5
2371.7
2444.1
2362.7
2465.2
2358.1
2442.5
Model 9: PC regression
2433.7
2500.0
2112.6
2190.9
2111.7
2184.0
2116.2
2218.7
2111.3
2195.7
2206.7
2273.0
Untransformed data
Transformed data
This evidence is also supported by the number of total errors made by the decision functions for the fitted
models. Models with untransformed variables make more errors by 3% as compared to models with
transformed variables. It appears that transformation creates slightly more false positive but less false
negative (Table 9). In addition, the comparison of ROC curves indicates that applying transformation can
increase the accuracy to distinguish between spam and non-spam e-mails. They are able to discriminate
better since the red curve is above the blue curve and it is also closer to the upper left corner (Figure 10).
%
True
negative
rate
False
positive
rate
False
negative
rate
Total
error
True
positive
True
negative
False
positive
False
negative
Total
error
True
positive
rate
Untransformed data
Model 1
252
424
858
72
180
16.4%
70.2%
92.3%
7.7%
29.8%
Model 2
252
423
859
71
181
16.4%
70.0%
92.4%
7.6%
30.0%
Model 3
253
423
858
72
181
16.5%
70.0%
92.3%
7.7%
30.0%
Model 4
253
423
858
72
181
16.5%
70.0%
92.3%
7.7%
30.0%
Model 9
256
419
859
71
185
16.7%
69.4%
92.4%
7.6%
30.6%
Transformed data
Model 5
208
470
856
77
131
13.6%
78.2%
91.7%
8.3%
21.8%
Model 6
206
469
859
74
132
13.4%
78.0%
92.1%
7.9%
22.0%
Model 7
209
469
856
77
132
13.6%
78.0%
91.7%
8.3%
22.0%
Model 8
208
470
856
77
131
13.6%
78.2%
91.7%
8.3%
21.8%
Model 10
230
467
837
96
134
15.0%
77.7%
89.7%
10.3%
22.3%
17
Figure 11. ROC curves for models with untransformed and transformed data
There are two best candidate model with transformed variables based on AIC and BIC, i.e. the reduced main
effects model and the reduced interaction model. The ROC curves for the two models overlap, meaning that
they have similar accuracy (Figure 11). Looking at the total errors, the main effects model makes less error
than the interaction model. Also, according to the fact that false positive is more serious than false negative
while doing spam filtering, then the main effects model is superior to the interaction model because it make
less false positive. In addition, the partial likelihood test indicates that the interaction model (Model 8) can
be simplified by removing the interaction term (Model 6). The p-value is 0.111 and it is greater than 0.05.
Thus I cannot reject the null hypothesis and model 6 (the reduced main effects model) performs better.
18
Residual
Deviance
Model 6
3055
2087.7
Model 8
3053
2083.3
DF
Deviance
Pr(>Chi)
4.4
0.111
.For efficiency, the main effects model is also preferable because it estimates less parameter. Thus, the
reduced main effects model with 11 log-transformed predictor variables is selected as the best model for
spam filter design and can be written as follows:
( = 1)
= 3.248 + 0.269 5 0.178 12 + 0.165 19 + 0.206 21 + 0.433 23 1.083
( = 0)
31 0.717 34 + 0.228 40 + 0.656 53 + 1.075 56 0.473 57
19
AIC
BIC
MSE
2494.0
2509.5
704.0
ARIMA(2,0,1)
2494.9
2514.3
703.5
ARIMA(0,1,1)
2489.4
2497.1
713.2
2610.4
2627.2
10.3
ARIMA(1,0,2)
2606.6
2627.6
8.2
ARIMA(2,0,1)
2606.2
2627.3
8.0
ARIMA(1,1,1)
2609.2
2621.8
2.0
The 3 best fitted models selected for spam residuals and 4 best fitted models selected for ham residuals are
summarized in Table 11 including the MSE score to evaluate the models with the best forecast. ARIMA
model obtained from automated selection procedure produces the lowest MSE score, this means that
ARIMA(2,0,1) provides the best forecast for spam residuals as compared to the other models. For ham
residuals, the ARIMA model has predicted negative values for the amount of e-mails received while this
variable can only take positive values. This is absolutely not a desirable feature of my current predictive
model. Thus, I decide to convert the values to zero when the model gives negative prediction to make more
sense of the estimated amount of e-mails received per day. I find that ARIMA(1,1,1) provide the best
forecast for ham residuals with MSE=2 with the lowest AIC value (2609.2) and lowest BIC value (2621.8).
Based on the diagnostic plots displayed in Figure 12 and 13, the standardized residual do not violate
assumption of constant location and scale. Most of the residuals are in the range (-2, 2). The ACF plot shows
that there is no significant spike before lag 25 or the residuals are not autocorrelated at lag 1. The Box-Ljung
test indicates that there is no non-zero autocorrelation among the 10 lags, except for ARIMA (1,0,1) to
model ham residuals. These shows that all the selected models are adequate except for ARIMA(1,0,1).
The predicted amount of e-mails received in the last 7 days based on ARIMA(2,0,1) for spam and
ARIMA(1,1,1) for ham are show in Figure 14. ARIMA(2,0,1) shows almost a constant prediction over the
last 7 days and ARIMA(1,1,1) overestimates the amount of ham e-mails especially in day 3 to day 6. The
residuals plots of these two ARIMA models show no deviation from normality assumption (Figure 15).
20
Figure 12. Diagnostic plots for spam: (a) ARIMA(1,0,1); (b) ARIMA(2,0,1); (c) ARIMA(1,0,1)
(b) ARIMA(1,0,1)
(a) ARIMA(2,0,1)
(c) ARIMA(1,1,1)
21
Figure 13. Diagnostic plots for ham: (a) ARIMA(1,0,1); (b) ARIMA(1,0,2); (c) ARIMA(2,0,1) ; (c) ARIMA(1,1,1)
(b) ARIMA(1,0,1)
(d) ARIMA(2,0,1)
(a) ARIMA(1,0,2)
(c) ARIMA(1,1,1)
22
Figure 14. Forecasting the amount of e-mails received in the last 7 days
23
4. Recommendation
For static filter design, it is important to transform the variables into log-scale as this will increase the
accuracy of logistic regression model to distinguish between spam and ham e-mails. There are 3 important
factors or variables that need to be considered in filtering out spam, i.e. frequency of some words/characters
and longest and total run-length of capital letters. These factors are highly significant in the model at 5%
level. The logistic regression model by considering these factors in log-scale as the main effect produces the
highest accuracy with smaller total errors (13.4%) and false positives (7.9%) made. It also fits better based
on BIC criteria (2184.0) and performs better based on ROC curve.
For time series filter design, I find that the daily amount of spam e-mails can be modeled by trend time series
and there is no need to account for seasonality component in the linear regression model. Time of arrival is
highly significant in the regression model at 5% level with an increasing trend. The spam residuals can be
modeled by ARMA model with 2 autoregressive (AR) terms and 1 moving average (MA) term and this
model gives the best forecast with MSE 693.2. Meanwhile, for ham data, the residuals can be modeled by
ARIMA(1,1,1) and this models gives the best forecast with MSE 2.0. This model also has the lowest AIC
and BIC values.
Given the evidence by static and time series filter design, I can calculate the probability of e-mails received
is a spam by integrating the two filter designs using Bayes rule. To get this probability, I need three terms to
be calculated, i.e. ( = 1), ( = 1/ = 1), and ( = 1/ = 1). The first term is the probability of
spam e-mails, the second term is the true rate positive, and the third term is the probability receiving spam emails at time t.
24
5. References
[1] A. Zanni and I. S. Perez, Spam Filtering," January 2012, UPC.
[2] D. E. Brown and L. Barnes, Project 2: Spam Filters," October 2013, assignment in class SYS 4021.
[3] D. E. Brown and L. Barnes, Project 2: Spam Filters template," October 2013, assignment in class SYS
4021.
[4] Joanne Pimanova, Email Spam Trends at a Glance: 2001-2012, [online]
http://www.emailtray.com/blog/email-spam-trends-2001-2012/ (assessed 11/08/2013)
[5] Machine Learning Repository, Spambase Dataset, http://archive.ics.uci.edu/ml/datasets/Spambase
(assessed 11/04/2013).
[6] Rao M., David H., 2012, The Economics of Spam, Journal of Economic Perspectives, [online]
http://www.aeaweb.org/articles.php?doi=10.1257/jep.26.3.87 (assessed 11/08/2013)
[7] T. Claburn, Spam Costs Billions, February 2005, [online] http://www.informationweek.com/spamcosts-billions/59300834 (assessed 11/04/2013).
25
Appendix A
Variable name
Description
V1 - V48
V49 - V54
V55
V56
V57
V1 - make
V10 - mail
V46 - edu
V2 - address
V11 - receive
V20 - credit
V29 - lab
V38 - parts
V47 - table
V3 - all
V12 - will
V21 - your
V30 - labs
V39 - pm
V48 - conference
V4 - 3d
V13 - people
V22 - font
V31 - telnet
V40 - direct
V49 - ;
V5 - our
V14 - report
V23 - 0
V32 - 857
V41 - cs
V50 - (
V6 - over
V15 - addresses
V24 - money
V33 - data
V42 - meeting
V51 - [
V7 - remove
V16 - free
V25 - hp
V34 - 415
V43 - original
V52 - !
V8 - internet
V17 - business
V26 - hpl
V35 - 85
V44 - project
V53 - $
V9 - order
V18 - email
V27 - george
V36 - technology
V45 - re
V54 - #
26
Residual
Deviance
Model 2
3055
2347.7
Model 1
3054
2344.1
DF
Deviance
Pr(>Chi)
3.597
0.058
Residual
Deviance
Model 4
3053
2330.1
Model 3
3050
2328.7
DF
Deviance
Pr(>Chi)
1.431
0.698
Residual
Deviance
Model 6
3055
2087.7
Model 5
3054
2086.6
DF
Deviance
Pr(>Chi)
1.104
0.293
Residual
Deviance
Model 8
3053
2083.3
Model 7
3050
2082.2
DF
Deviance
Pr(>Chi)
1.010
0.799
Residual
DF
Residual
Deviance
3066
4113.4
3056
2411.7
DF
Deviance
Pr(>Chi)
10
1701.7
< 2.2e-16
Residual
DF
Residual
Deviance
3066
4116
3056
2184.7
DF
Deviance
Pr(>Chi)
10
1931.3
< 2.2e-16
27
Spam
Ham
AIC
BIC
AIC
BIC
2511.4
2523.0
2623.6
2636.2
2506.6
2522.1
2617.1
2633.9
2501.9
2521.3
2514.9
2526.6
2639.8
2652.4
2512.1
2527.6
2627.7
2644.5
2505.4
2524.8
2622.2
2643.2
2494.0
2509.5
2610.4
2627.2
2495.0
2514.4
2606.6
2627.6
2496.8
2520.1
2608.0
2633.3
2494.9
2514.3
2606.2
2627.3
2497.3
2520.6
2608.2
2633.5
2496.9
2524.0
2610.0
2639.5
2496.7
2520.0
2496.8
2524.0
2498.6
2529.6
2489.4
2497.1
2620.3
2628.8
2489.9
2501.5
2609.2
2621.8
2491.6
2507.1
2610.3
2627.2
2492.5
2511.9
2612.1
2633.2
0
0
0
0
0
28
Appendix B
29
30
31
32
33
34
35