You are on page 1of 36

LINEAR STATISTICAL MODELS

SYS 4021

Project 2: Spam Filtering

Donald E. Brown

Laura Barnes

brown@virginia.edu

lb3dp@virginia.edu

Summary
In this study, I use logistic regression model to build static filter design, i.e. to classify e-mails as spam and
ham. I find that there are 3 important variables that need to be considered in filtering out spam, i.e. frequency
of some words/characters in the message, longest run-length of capital letter, and total run-length of capital
letters. The variables are highly significant in the logistic regression model at 5% level. It is important to
transform the predictors into a log-scale as this will increase the accuracy of the model. The final model
selected for spam filtering has the highest accuracy with smaller total errors (13.4%) and false positives
(7.9%) made. It also fits better based on BIC criteria. For spam filtering, I also build time series filter design
to predict the daily amount of spam e-mails. I found that there is a relationship between the amount of spam
e-mails received and time of arrivals. Time is highly significant in the linear regression model at 5% level.
For spam data, the residuals can be modeled by ARMA model with 2 autoregressive (AR) terms and 1
moving average (MA) term and this model gives the best forecast. Meanwhile, for ham data, the residuals
can be modeled by ARIMA(1,1,1) and this models gives the best forecast with MSE 2.0. ARIMA(1,1,1)
model also has the lowest AIC and BIC values. Both models shows adequacy from the Ljung-Box Q-statistic
plot since all the points are insignificant. The static and time series filter design can be integrated to produce
an overall filter design by using Bayes rule. It means that for any email that comes into my classifier, the
probability of getting a spam e-mail is determined by the probability of my e-mail is spam based on the static
filter and the probability of my e-mail is spam based on the time series filter.
Honor pledge: On my honor, I pledge that I am the sole author of this paper and I have accurately cited all
help and references used in its completion.

Imran A. Khan
November 10, 13

1. Problem Description
1.1. Situation
E-mail is the most affordable, easiest, and fastest means of communication. To many email users around the
world, unsolicited junk or spam is one of the biggest problems facing the internet today. Spam often contains
annoying advertisement for services or products and adult content or other inappropriate material. It wastes
peoples time and important bandwidth of internet connection. The damage of spam has developed into a
serious technical, economic, and social threat. The cost of spam in terms of lost productivity is estimated to
be about $21.58 billion annually, according to the 2004 National Technology Readiness Service

[7]

. The

survey reveals that the internet users in the US spend about 3 minutes deleting spam e-mails every day. This
comes to 22.9 million hours a week or $20.58 billion based on an average wage across 169.4 million online
adults in the US.
Spammers are dumping a lot on society and reaping fairly little in return as noted by Rao and Reiley in their
report The Economic Spam

[6]

. With 94 billion spam messages sent daily, it's costing society around $20

billion and the revenue adds up to only $200 million. Although the revenue of spamming is very low, it is
really cheap for spammers to send a massive amount of e-mail to users. They estimates that spammers only
need 1 in 25000 receivers to buy something to have a good profit. Comparing spam cost to the advertising
media as shown in Table 1, I can see that the cost for spam mailing (per thousand impressions) is the lowest
one. It is also very easy for spammers to break even with a marginal profit of $50.
Table 1. Cost of spam advertising relative to other advertising media (cost per thousand impressions)
Breakeven conversion with
marginal profit = $50.00
Per 100,000
Advertising vector
CPM
Percent
deliveries
Postal direct mail
Super Bowl advertising
Online display advertising
Retail spam

$250-$1,000

2-10%

2000

$20

0.04

40

$1-$5

0.002-0.006%

$0.10-$0.50

0.001-0.002%

0.3

Botnet wholesale spam

$0.03

0.00%

0.06

Botnet via webmail

$0.05

0.00%

0.1

The number of spam e-mail sent every 24 hours has increased from year to year as the number of e-mail
users has increased too and this causes more lost productivity every year. As shown in Figure 1, about 8% of
all e-mails are identified as spam in 2001 and it only increase about 1% in 2002. But in 2003, it has a
significant increase to 40% of e-mails scanned as spam and it continuously increasing until 2010. I can see
that e-mail spam rate has a significant drop in the last two years. According to Symantec Intelligence Report
1

issued in February 2012 [4], global spam levels continued to fall, as it now accounts for 68% of global email
traffic. However, spam still becomes a major problem for many companies and individual email users.
Therefore, it is very important to classify the emails to spam and non-spam so that loss of productivity can be
reduced.
Figure 1. Spam rate over time
100%
86.2%

90%
80%

84.6%

87.7%

89.1%

81.2%
75.1%

72.3%
68.6%

70%

68.0%

60%
50%
40.0%

40%
30%
20%
10%

8.0%

9.0%

2001

2002

0%
2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

Many studies have been conducted to deal with spam. One way to filter out spam message is by identifying
word patterns or word frequency or identifying certain symbols in the message. Based on an e-mail database
collected at Hewlett-Packard Labs[1],

, the most common words used in spam message is our, free,

[2], [5]

you, your, money, and $. As shown in Table 2, the frequency of these words make a big difference
in differentiating between wanted and unwanted e-mails and they are more likely to appear in spam message.
Table 2. Words/symbols average frequency appear in spam and ham e-mails
Spam

Ham

our

0.51

0.18

free

0.52

0.07

you

2.26

1.27

your

1.38

0.44

money

0.21

0.02

0.17

0.01

The average, longest and total run-length of capital letters in a message also can be used to filter out spam.
These 3 indicators shows that spam tends to use more capital letters in the message (Table 3).
2

Table 3. The average, longest and total run-length of capital letters in a message
Spam

Ham

Average

9.5

2.4

Longest

104.4

18.2

Total

470.6

161.5

1.2. Goal
The purpose of this study is to build a static filter to determine if a certain e-mail is spam or not based on a
certain words, characters, and the length of capital letters spam. It is also of interest to build a time series
filter to detect e-mail spam.
1.3. Metrics
Spam variable, a categorical variable that indicates the type of the mail and is either ham or spam, is
used as a response variable to build a static filter by fitting logistic regression models. I utilize AIC and BIC
criteria to compare the performance between the fitted models. I use decision threshold of 0.5 to evaluate
different models using the errors made by the decision functions for that model. In addition, ROC curve is
also plotted to evaluate the ability of the logistic models to distinguish between spam and ham e-mails.
Count variable that indicates the amount of spam and ham e-mails received per day is considered to build
time series filter. I utilize AIC and BIC criteria to assess the different fitted time series models. I use MSE to
evaluate model with the best forecast.
For both static and time series analysis, I divide the data into two parts, i.e. training set and test set. Training
set is used for model building and test set is used to evaluate the model performance. I use significance level
of 5% for the analysis throughout this study. If the confidence level (p-value) is less than 0.05, then my (null)
hypothesis is rejected in favor of the alternative. Alternatively, if p-value is greater than 0.05, then my null
hypothesis should not be rejected.
1.4. Hypotheses
I have two hypotheses in this study: one for static filter and one for time series filter.
Hypothesis 1:
Ho: There is no relationship between word frequency and sequence of capital letters to spam
H1: There is a relationship between word frequency and sequence of capital letters to spam

Hypothesis 2:
Ho: There is no relationship between time of arrival and spam
H1: There is a relationship between time of arrival and spam

2. Approach
2.1. Data
Two datasets are used in this study. The first dataset is used to build a static filter and it comes from HewlettPackard Labs created by Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt. The spam data
contains 4601 observations and 58 variables

[2], [3], [5]

. The first 57 variables are the predictor variables. They

are continuous real and indicate the frequency of certain words and characters in the e-mail. The last variable
indicates if a mail is a spam (1) or ham (0). A detailed description about the variables is summarized in Table
A1 and A2 (Appendix A). There are no missing values in the dataset. In total, there are 1813 (39.4%) e-mails
labeled as spam and 2788 (60.6%) labeled as ham.
The second dataset is used to build time series filters. It is a time series data that consists of two e-mail
databases, i.e. spam and ham e-mails[2], [3]. There are 364 observations and 4 variables in the spam database
and they are collected from 2004-2005. There are 506 observations and 4 variables in the ham database and
they are collected from 2000-2001. The first 3 variables indicate the year, the month, and the day and the last
variable indicates the amount of e-mails received. There are no missing values in both datasets.

2.2. Analysis
2.2.1.

Static Analysis for Spam Filter Design

To build spam filter model, I perform logistic regression analysis using R software. The stages of data
analysis are as follows:
1. Divide the dataset into training set (2/3) and test set (1/3). Training set is used to build logistic regression
models and test set is used to evaluate the performance of the fitted models.
2. Build main effect model by taking into account a smaller set of predictor variables, no informative
predictors are not included. Variable that shows discriminatory of spam in factor plot is selected as a
candidate of predictor in the main model. Principal Component Analysis (PCA) is also used to select
variables that have high loadings in the first and second principal component.
4

3. Build an interaction model by including relevant interactions in the main effect model. This is done by
performing a logistic regression with only two predictor variables, including the interaction between
them, to predict the outcome. The interaction plot displays the predicted probability of the response
variable with the interaction between the two predictors displayed as factors (categorical variables). The
predictor variables are categorized into two levels:
= {

, <
, for = 1,2, , and = 1,2, ,
,

where is the mean for , n is the number of observations, and p is the number of predictor variables.
The interaction is selected when the estimated parameter of the two-way interaction is significant in the
model and it shows a crossover interaction in the interaction plot. In other words, the two lines have
opposite slope or they intersect.
4. Perform stepwise selection procedure to select important variables in the main effects model as well as
in the interaction model. The reduced model is then examined by using partial likelihood test to check if
some of the predictors can be eliminated from the model.
5. Fit main effect and interaction models by using log scale for the predictors with offset 0.01, i.e. log( +
0.01). Repeat step 4 to reduce the model with logs transformation in the regression equation.
6. Fit Principal Component (PC) regression that accounts 98% of variability to the data before and after
log-transformation.
7. Compare main effects, interaction, and PC regression models before and after transformation by using
AIC and BIC criteria to select the best model.
8. Evaluate the model by computing the score table with the decision threshold 0.5 for each model and plot
the ROC curve to show the performance of all models.
Figure 2 displays the biplot for 1 57 . I can observe that ham and spam e-mails have the greatest variance
in the direction of the first and second principal component, respectively. Figure 3 displays the biplot for
1 57 after log-transformation and it shows that spam emails have about equal variance in the first and
second components. The two components for the log transformed variables explain more variance (27.9%)
as compared to the two components without transforming the variables (17.3%). The first three predictors
that have high loadings in the first and second components are taken into account in the main effects model.
For PCA before transformation: V32, V34, and V40 have high loadings in the first component and V21, V56,
and V23 have high loadings in the second component. For PCA after transformation: V53, V56, and V57 have
high loadings in the first component and V31, V32, and V34 have high loadings in the second component.

Figure 2. PCA plot on untransformed variables

Figure 3. PCA plot on transformed variables

Based on factor plot, there are only 5 out of 57 variables that show discriminatory of spam. They are V3, V5,
V12, V19, and V21 as shown in Figure 4 and the factor plot for the rest of predictors are shown in Figure B1
and B2 in Appendix B. The mean frequency of these variables is clearly higher for spam than ham e-mails
and it is higher than zero.

Figure 4. Factor plots

The correlation matrix between the predictors selected based on PCA and factor plot is plotted in Figure 5.
The predictors show relatively small relationship between them, only few of them have correlation greater
than 0.5. It can be observed that V32 and V34 have a perfect correlation. It means no need to put them together
in the model because both variables convey the same information. It is also done to avoid multicolinearity.
Therefore, only V34 is retained in the model building.
Figure 5. Correlation matrix

In total, I have 12 potential predictors in my main effects model and they can be expressed in the following
logistic regression model (Model 1):

( = 1)
= 0 + 3 3 + 5 5 + 12 12 + 19 19 + 21 21 + 23 23 + 31 31 + 34 34 + 40 40 + 53 53
( = 0)
+ 56 56 + 57 57

It appears that the stepwise selection procedure produces the same main effects as Model 1, meaning no
predictors is eliminated from the model. However, I find that V34 is insignificant in the model thus I removed
this variable from the model. The partial likelihood test shows that V34 can be removed and the null
hypothesis cannot be rejected with p-value = 0.06 (Table A3, Appendix A) and thus the reduced model is
preferable (Model 2).
Figure 6. Interaction plot

Based on interaction plots, there are 4 interactions with crossover as shown in Figure 6 and the rest of the
interaction plots can be found in Figure B3 in Appendix B. Including these interactions in the main model,
my interaction model can be written as follow (Model 3):

( = 1)
= 0 + 3 3 + 5 5 + 12 12 + 19 19 + 21 21 + 23 23 + 31 31 + 34 34 + 40 40 + 53 53
( = 0)
+ 56 56 + 57 57 + 3,40 3 40 + 23,34 23 34 + 23,40 23 40 + 34,40 34 40

Model 3 can be reduced by removing 3 interaction terms in the model with stepwise selection procedure. The
partial likelihood test shows that the null hypothesis cannot be rejected with p-value = 0.70 (Table A4,
Appendix A) and thus the reduced interaction model is preferable (Model 4). Table 4 summarizes the
estimated parameters and the corresponding standard errors for the four models.

Table 4. Parameter estimates and standard errors for logistic regression models with untransformed data
Model 2:
Model 4:
Model 1:
Model 3:
The reduced main
The reduced
The main effect model
The interaction model
effect model
interaction model
Estimate Std. Error Estimate Std. Error Estimate
Std. Error
Estimate Std. Error
0

-2.362

0.100

-2.372

0.100

-2.357

0.100

-2.359

0.100

0.097

0.097

0.189

0.096

0.188

0.095

0.818

0.090

0.811

0.090

0.813

0.090

0.813

0.090

12

-0.316

0.074

-0.310

0.074

-0.324

0.075

-0.323

0.075

19

0.212

0.028

0.215

0.028

0.215

0.028

0.215

0.028

21

0.402

0.043

0.398

0.043

0.399

0.043

0.398

0.043

23

4.581

0.648

4.594

0.649

4.445

0.648

4.569

0.649

31

-17.470

3.963

-17.690

3.928

-16.790

4.029

-17.650

3.751

34

1.110

-1.589

1.138

0.396

0.420

-1.593

1.135

0.150

-1.516

0.383

0.423

0.148

40

1.071

0.534

1.079

0.536

53

7.413

0.727

7.444

0.729

7.372

0.723

7.396

0.724

56

0.016

0.002

0.016

0.002

0.016

0.002

0.016

0.002

0.0001

0.0001

0.0001

0.0001

8.310

3.094

8.436

2.824

75.090

27040.0

38.230

47.510

-50.240

99.810

57
3,40

0.0002

0.0002

23,34
23,40
34,40

0.0001

AIC

2370.1

2371.7

2362.7

BIC

2448.5

2444.1

2465.2

0.0002

2358.1
2442.5
*insignificant at 5% level

Using the same approach to build logistic regression model with untransformed variables, the main effects
model with transformed variables in the regression equation can be reduced by dropping one predictor, i.e.
V3. The partial likelihood test shows that the null hypothesis cannot be rejected with p-value = 0.29 (Table
A5, Appendix A). This means the reduced main effects model performs better. Similarly, the interaction
model with transformed variables can be reduced by dropping 3 interaction terms; the partial likelihood test
shows insignificant p-value (Table A6, Appendix A). Table 5 summarizes the estimated parameters and
standard errors for the four models with transformed variables in the logistic regression equation.

Table 5. Parameter estimates and standard errors for logistic regression models with transformed data
Model 2:
Model 4:
Model 1:
Model 3:
The reduced main
The reduced
The main effect model
The interaction model
effect model
interaction model
Estimate Std. Error Estimate Std. Error Estimate
Std. Error
Estimate Std. Error
0

-3.091

0.032

0.265

0.029

0.269

12

-0.181

0.028

19

0.162

21

0.203

0.033

1.362

-3.248

1.352

45.3*

10180.0

-2.050

1.505

0.221

0.221

0.029

0.264

0.029

0.264

0.029

-0.178

0.028

-0.181

0.028

-0.181

0.028

0.028

0.165

0.028

0.163

0.028

0.163

0.028

0.027

0.206

0.027

0.205

0.027

0.205

0.027

2217.0

0.430

0.064

0.426

0.423

23

0.429

0.064

0.433

0.063

16.9

31

-1.074

0.178

-1.083

0.178

-1.040

0.186

-1.057

0.184

0.222

9.764

2210.0

-0.710

0.223

-6.052

34

-0.713

0.222

-0.717

40

0.224

0.093

0.228

0.093

176.8

0.436

0.158

53

0.656

0.051

0.656

0.051

0.652

0.051

0.653

0.051

56

1.082

0.084

1.075

0.084

1.083

0.084

1.084

0.084

57
3,40

-0.491

0.080

-0.473

0.078

-0.492

23,34
23,40
34,40

0.080

-0.494

0.080

0.087

0.049

0.049

3.633

481.5

-0.051

0.083

-1.362

38.39

AIC

2112.6

2111.7

2116.2

BIC

2190.9

2184.0

2218.7

0.087

2111.3
2195.7
*insignificant at 5% level

PC regression is also fitted to the data before and after transformation by accounting 12 predictors selected in
the main effects model. There are 10 components that can explain 98% of the variation. The utility test
shows that the null hypothesis is rejected, meaning the 10 components cannot be removed from the model
(Table A7 and A8, Appendix A). The results are shown in Table 6.

10

Table 6. Parameter estimates and standard errors for PC regression models


Model 9: Untransformed data

Model 10: Transformed data

Estimate

Std. Error

Estimate

Std. Error

0.159

0.089

-0.829

0.082

Comp.1

1.403

0.150

-1.169

0.045

Comp.2

-2.977

0.174

-1.017

0.155

Comp.3

1.080

0.141

0.406

0.055

Comp.4

0.637

0.192

-0.694

0.066

Comp.5

-0.397

0.072

0.708

0.068

Comp.6

0.210

0.114

-0.022

0.059

Comp.7

-0.189

0.061

-0.327

0.065

Comp.8

0.004

0.102

-0.204

0.079

Comp.9

0.182

0.177

0.507

0.102

Comp.10

-2.279

0.274

0.552

0.124

(Intercept)

AIC

2433.7

2206.7

BIC

2500.0

2273.0
*insignificant at 5% level

2.2.2.

Time Series Analysis for Spam Filter Design

To build time series filter design, I perform time series analysis using R software. The stages of data analysis
are as follows:
1. Plot periodogram to discover the periodic components of a time series.
2. Model the trend and seasonality components of spam and ham by using linear regression model with
the amount of e-mails received per day is used as the response variable. The model building process
uses all the data except the last 7 days. The last week data is used for forecasting.
3. Get the residuals from the fitted regression model to check if they show correlation. If the residuals
show non-stationary, then the first differences of the residuals is considered.
4. Examine ACF and PACF plots to estimate the order p for AR model and the order q for MA model.
5. Select several candidate models by using AIC and BIC criteria and also use diagnostic plot to check
for model adequacy. MSE is also considered to evaluate the performance of the models for
forecasting.
Figure 7 displays the amount of spam and ham e-mails received per day. The amount of spam e-mails
mostly ranges from 10 to 40 mails per day. The time series goes up and down slowly between the
average 27 spam emails per day. Although the amount of spam emails has seen a few drops recently, it
tends to upward a little bit with some significant peaks at 40 to 60 spam emails. Meanwhile, the majority
11

of ham e-mails fluctuate from 0 to 10 e-mails per day with the average number of daily ham e-mails is
about 4. The series tends to wander up and down the average for a while and then drop to zero.
Figure 7 also displays the periodogram for spam and ham data. The spam periodogram shows no
obvious seasonality component. It reaches the highest peak with the lowest frequency of just less than
0.1, i.e. at period 375 days. The ham periodogram shows a weekly seasonality component with the
highest peak at frequency 0.15, i.e. at period 6.7 days.

Figure 7. Periodogram for spam and ham data

The trend and daily seasonality component for the amount of spam and ham e-mails received per day are
modeled by the following regression model:
= 0 + 1 + 2 + 3 + 4 + 5 + 6 + 7 +
where is the amount of spam/ham e-mails at day i; is time/day; Sun, Mon, Tue, Wed, Thu, and Fri
are indicator for daily seasonality with Saturday as the base case.

12

I find that time is a significant component to the amount of spam e-mails received per day. T-test shows
that 1 is not equal to zero with p-value < 0.001 (Table 7). This means that time is an important factor to
the amount of spam e-mails with a very slow increasing trend with the increasing of time daily. Also, it
appears that daily seasonality is not an important component in the spam regression model. The partial
F-test shows that daily seasonality can be removed from the model with F-stat = 0.813 and p-value =
0.56. Thus, the spam time series data can be modeled by using the trend time only. This confirms the
periodogram where there is no obvious seasonality observed for the spam data. For ham time series data,
seasonality is an important component in the regression model and cannot be dropped from the model.
The estimated regression parameters and standard errors for spam and ham regression models are given
in Table 7. The two models are found significant overall at 5%. Since regression models of time series
frequently have correlated residuals, is then corrected by a time series model.
Table 7. The estimated paramaters and standard errors for spam and ham linear regression model
Parameter

Spam Regression Model

Ham Regression Model

Estimate

Std. Error

Estimate

Std. Error

23.997

0.881

0.570

0.508

0.019

0.004

-0.001

0.001

0.240

0.603

4.495

0.603

5.510

0.603

5.286

0.603

5.136

0.601

7
R2

4.707

0.601

Adjusted R2
Overall model

5.3%

28.3%

5,1%
F-statistic: 19.99 on 1 and 335
DF, p-value: 1.05e-05

27.3%
F-statistic: 27.73 on 7 and
491 DF, p-value: < 2.2e-16

The histogram of spam residuals is normally distributed and the histogram of ham residuals is skewed to
the right. The plot of residuals from linear regression fit shows that the run sequences have non-constant
mean. It seems that the mean is changing over time, thus the regression residuals for both spam and ham
are considered to be non-stationary and it may be necessary to take the first difference of the residuals
(Figure 8).

13

Figure 8. Residuals plot

However, I can still consider several potential ARIMA models by looking at the ACF and PACF plot for
the residuals before taking any differences. I can observe that the sample ACF decay relatively slowly
and the sample PACF plot has insignificant peak after lag 3. Thus, AR(3) is one of the candidate model.
However, I will also consider ARMA model by including AR and MA terms in the model. Although the
ARMA lags cannot be selected solely by looking at the ACF and PACF, but it seems no more than 3 AR
and 3 MA terms are needed for spam residuals and no more than 2 AR and 3 MA terms are needed for
ham residuals. To identify the best lags, I fit several models with all combinations of p and q. I also use
automated procedure to find the best ARIMA model. For spam residuals, ARIMA(1,0,1) is the best fitted
model based on AIC/BIC criteria and automated procedure. ARIMA(2,0,1) is the second best model
based on AIC/BIC criteria. For ham residuals, ARIMA(2,0,1), ARIMA(1,0,1), and ARIMA(1,0,2) are
best fitted model based on AIC, BIC, and automated procedure, respectively. I find that ARIMA(0,1,1)
and ARIMA(1,1,1) are best fitted model for spam and ham residuals, respectively, based on AIC and
BIC criteria. The AIC and BIC values for all potential models are summarized in Table A9 (Appendix
A).
After taking the first difference of the residuals, the run sequence plot indicates that the data have a
constant location and variance; this means the resulting time series appear to be stationary for both spam
and ham residuals (Figure 9). The ACF plot shows that the autocorrelation at lag 1 exceeds the
significance bound, but all other autocorrelation between 1-25 do not exceeds the significance bound.
Since the ACF is zero after lag 1 and the PACF decays more slowly, then I can consider ARIMA(0,1,1)
as another candidate model. I also take into account MA term for the time series of first differences since
14

I can consider that there is a significant spike at lag 3 for spam residuals in the PACF plot. I can also
consider the same number of MA terms added for ham residuals model since its not necessary to take
into account lags at high number. After fitting all combination of p and q for the first differences, I find
that ARIMA(0,1,1) and ARIMA(1,1,1) are the best fitted model for spam and ham residuals,
respectively, based on AIC and BIC criteria (Table 9, Appendix A).
Figure 9. The first difference of residuals plot

2.2.3.

Integrated Filter Design

The integrated spam filter design can be formulated by Bayes rule:


(|) =

(|)()
()

It expresses the conditional probability, which represents the probability of an event A occurring given some
evidence B. This equation can be used to derive the mathematical model to integrate the static and time
series filters. The purpose is to calculate the probability of spam occurring given two evidences, i.e. the
probability of spam e-mail and the forecast amount of spam at time arriving:
( = | = , = ) =

( = , = | = )( = )
= , = | = )( = )

1=0 (

where:

15

E be the event of spam e-mail (1) or non-spam e-mail (0)

S be the static filter indicates the message is spam (1) or non-spam (0)

T as the time series filter indicates the message is spam (1) or non-spam (0)

i, j, and k can only be 0 (non-spam) or 1(spam)

Assuming S and T are completely independent then ( = 1, = 1| = 1) can be decomposed into two
terms:
( = 1, = 1| = 1) = ( = 1/ = 1)( = 1/ = 1)
The first term is the probability that the static filter gives positive evidence (spam) given my e-mail is a
spam. This is calculated as a true positive rate, i.e. TP / (TP + FN), where TP is a true positive and FN is a
false negative. The second term is the probability that time series filter gives positive evidence (spam) given
my e-mail is a spam. This is calculated as the number of spam received at time t divided by total e-mails
(spam and ham) received at time t. In other words, ( = 1/ = 1) =

#
.
# #

Meanwhile ( = 1)

is calculated as the probability of a message is a spam. Using the data from Hewlett-Packard Labs, I can
calculate that the probability of a spam message is 0.4.

3. Evidence
3.1.

Static Filter Design

Reducing the number of predictors in logistic regression model is able to increase the performance of the
fitted model because the AIC and BIC values are smaller than the full model. But this is not the case for the
main effects model with untransformed data since AIC/BIC value is smaller for Model 1. For untransformed
data, the reduced interaction model fits better. It has the smallest AIC (2358.1) and BIC (2442.5) values. For
transformed data, the reduced interaction model (Model 8) fits better based on AIC and the reduced main
effects model fits better based on BIC criteria. In general, logistic regression models with transformed
variables have better performance because the AIC and BIC values are smaller than logistic regression
model with untransformed variables (Table 8).

16

Table 8. Model assessment


AIC

BIC

Model 1: The main effects model

2370.1

2448.5

Model 2: The reduced main effects model

2371.7

2444.1

Model 3: The interaction model

2362.7

2465.2

Model 4: The reduced interaction model

2358.1

2442.5

Model 9: PC regression

2433.7

2500.0

Model 5: The main effects model

2112.6

2190.9

Model 6: The reduced main effects model

2111.7

2184.0

Model 7: The interaction model

2116.2

2218.7

Model 8: The reduced interaction model

2111.3

2195.7

Model 10: PC regression

2206.7

2273.0

Untransformed data

Transformed data

This evidence is also supported by the number of total errors made by the decision functions for the fitted
models. Models with untransformed variables make more errors by 3% as compared to models with
transformed variables. It appears that transformation creates slightly more false positive but less false
negative (Table 9). In addition, the comparison of ROC curves indicates that applying transformation can
increase the accuracy to distinguish between spam and non-spam e-mails. They are able to discriminate
better since the red curve is above the blue curve and it is also closer to the upper left corner (Figure 10).

Table 9. Score table


Count

%
True
negative
rate

False
positive
rate

False
negative
rate

Total
error

True
positive

True
negative

False
positive

False
negative

Total
error

True
positive
rate

Untransformed data
Model 1

252

424

858

72

180

16.4%

70.2%

92.3%

7.7%

29.8%

Model 2

252

423

859

71

181

16.4%

70.0%

92.4%

7.6%

30.0%

Model 3

253

423

858

72

181

16.5%

70.0%

92.3%

7.7%

30.0%

Model 4

253

423

858

72

181

16.5%

70.0%

92.3%

7.7%

30.0%

Model 9

256

419

859

71

185

16.7%

69.4%

92.4%

7.6%

30.6%

Transformed data
Model 5

208

470

856

77

131

13.6%

78.2%

91.7%

8.3%

21.8%

Model 6

206

469

859

74

132

13.4%

78.0%

92.1%

7.9%

22.0%

Model 7

209

469

856

77

132

13.6%

78.0%

91.7%

8.3%

22.0%

Model 8

208

470

856

77

131

13.6%

78.2%

91.7%

8.3%

21.8%

Model 10

230

467

837

96

134

15.0%

77.7%

89.7%

10.3%

22.3%

17

Figure 10. ROC curve SPAM Filter

Figure 11. ROC curves for models with untransformed and transformed data

There are two best candidate model with transformed variables based on AIC and BIC, i.e. the reduced main
effects model and the reduced interaction model. The ROC curves for the two models overlap, meaning that
they have similar accuracy (Figure 11). Looking at the total errors, the main effects model makes less error
than the interaction model. Also, according to the fact that false positive is more serious than false negative
while doing spam filtering, then the main effects model is superior to the interaction model because it make
less false positive. In addition, the partial likelihood test indicates that the interaction model (Model 8) can
be simplified by removing the interaction term (Model 6). The p-value is 0.111 and it is greater than 0.05.
Thus I cannot reject the null hypothesis and model 6 (the reduced main effects model) performs better.

18

Table 10. Partial likelihood test between model 6 and model 8


Residual
DF

Residual
Deviance

Model 6

3055

2087.7

Model 8

3053

2083.3

DF

Deviance

Pr(>Chi)

4.4

0.111

.For efficiency, the main effects model is also preferable because it estimates less parameter. Thus, the
reduced main effects model with 11 log-transformed predictor variables is selected as the best model for
spam filter design and can be written as follows:

( = 1)
= 3.248 + 0.269 5 0.178 12 + 0.165 19 + 0.206 21 + 0.433 23 1.083
( = 0)
31 0.717 34 + 0.228 40 + 0.656 53 + 1.075 56 0.473 57

where = log( + 0.01).


In my final model, all the 11 predictors are highly significant at 5%. The first 9 predictors in the above
model, i.e. V5, V12, V19, V21, V23, V31, V34, V40, and V53, indicate the frequency of certain words / symbols in
the message. The last 2 predictors, i.e. V56 and V57, indicate the longest and total length interrupted
sequences of capital letters in a message. This shows that I can reject the null hypothesis, meaning that there
is a relationship between word frequency and sequence of capital letters to spam.

3.2. Time Series Filter Design


Time of arrival is an important variable in predicting the amount of spam e-mails received per day. The t-test
statistic shows that the estimated parameter for time in the regression model is highly significant at 5% level
(t-stat = 4.47, p-value < 0.0001). This means that time cannot be eliminated from the model. Since time
series regression model frequently have correlated residuals, it is important to model them. I find that the
regression residuals for spam and ham are indeed correlated based on the ACF plot.
Since the run sequence of residuals seems to have non-constant mean, I consider the first differences of
residuals. This lead me to have several candidate models to model the residuals. I use AIC, BIC, and
automation procedure to obtain the best potential models. In addition, I use MSE to evaluate the model with
the best forecast by using the last 7 days of the data as the test set.

19

Table 11. Model assessment


Model

AIC

BIC

MSE

Spam residuals model


ARIMA(1,0,1)

2494.0

2509.5

704.0

ARIMA(2,0,1)

2494.9

2514.3

703.5

ARIMA(0,1,1)

2489.4

2497.1

713.2

Ham residuals model


ARIMA(1,0,1)

2610.4

2627.2

10.3

ARIMA(1,0,2)

2606.6

2627.6

8.2

ARIMA(2,0,1)

2606.2

2627.3

8.0

ARIMA(1,1,1)

2609.2

2621.8

2.0

The 3 best fitted models selected for spam residuals and 4 best fitted models selected for ham residuals are
summarized in Table 11 including the MSE score to evaluate the models with the best forecast. ARIMA
model obtained from automated selection procedure produces the lowest MSE score, this means that
ARIMA(2,0,1) provides the best forecast for spam residuals as compared to the other models. For ham
residuals, the ARIMA model has predicted negative values for the amount of e-mails received while this
variable can only take positive values. This is absolutely not a desirable feature of my current predictive
model. Thus, I decide to convert the values to zero when the model gives negative prediction to make more
sense of the estimated amount of e-mails received per day. I find that ARIMA(1,1,1) provide the best
forecast for ham residuals with MSE=2 with the lowest AIC value (2609.2) and lowest BIC value (2621.8).
Based on the diagnostic plots displayed in Figure 12 and 13, the standardized residual do not violate
assumption of constant location and scale. Most of the residuals are in the range (-2, 2). The ACF plot shows
that there is no significant spike before lag 25 or the residuals are not autocorrelated at lag 1. The Box-Ljung
test indicates that there is no non-zero autocorrelation among the 10 lags, except for ARIMA (1,0,1) to
model ham residuals. These shows that all the selected models are adequate except for ARIMA(1,0,1).
The predicted amount of e-mails received in the last 7 days based on ARIMA(2,0,1) for spam and
ARIMA(1,1,1) for ham are show in Figure 14. ARIMA(2,0,1) shows almost a constant prediction over the
last 7 days and ARIMA(1,1,1) overestimates the amount of ham e-mails especially in day 3 to day 6. The
residuals plots of these two ARIMA models show no deviation from normality assumption (Figure 15).

20

Figure 12. Diagnostic plots for spam: (a) ARIMA(1,0,1); (b) ARIMA(2,0,1); (c) ARIMA(1,0,1)
(b) ARIMA(1,0,1)

(a) ARIMA(2,0,1)

(c) ARIMA(1,1,1)

21

Figure 13. Diagnostic plots for ham: (a) ARIMA(1,0,1); (b) ARIMA(1,0,2); (c) ARIMA(2,0,1) ; (c) ARIMA(1,1,1)
(b) ARIMA(1,0,1)

(d) ARIMA(2,0,1)

(a) ARIMA(1,0,2)

(c) ARIMA(1,1,1)

22

Figure 14. Forecasting the amount of e-mails received in the last 7 days

Figure 15. Residuals plot of ARIMA model selected

23

4. Recommendation
For static filter design, it is important to transform the variables into log-scale as this will increase the
accuracy of logistic regression model to distinguish between spam and ham e-mails. There are 3 important
factors or variables that need to be considered in filtering out spam, i.e. frequency of some words/characters
and longest and total run-length of capital letters. These factors are highly significant in the model at 5%
level. The logistic regression model by considering these factors in log-scale as the main effect produces the
highest accuracy with smaller total errors (13.4%) and false positives (7.9%) made. It also fits better based
on BIC criteria (2184.0) and performs better based on ROC curve.
For time series filter design, I find that the daily amount of spam e-mails can be modeled by trend time series
and there is no need to account for seasonality component in the linear regression model. Time of arrival is
highly significant in the regression model at 5% level with an increasing trend. The spam residuals can be
modeled by ARMA model with 2 autoregressive (AR) terms and 1 moving average (MA) term and this
model gives the best forecast with MSE 693.2. Meanwhile, for ham data, the residuals can be modeled by
ARIMA(1,1,1) and this models gives the best forecast with MSE 2.0. This model also has the lowest AIC
and BIC values.
Given the evidence by static and time series filter design, I can calculate the probability of e-mails received
is a spam by integrating the two filter designs using Bayes rule. To get this probability, I need three terms to
be calculated, i.e. ( = 1), ( = 1/ = 1), and ( = 1/ = 1). The first term is the probability of
spam e-mails, the second term is the true rate positive, and the third term is the probability receiving spam emails at time t.

24

5. References
[1] A. Zanni and I. S. Perez, Spam Filtering," January 2012, UPC.
[2] D. E. Brown and L. Barnes, Project 2: Spam Filters," October 2013, assignment in class SYS 4021.
[3] D. E. Brown and L. Barnes, Project 2: Spam Filters template," October 2013, assignment in class SYS
4021.
[4] Joanne Pimanova, Email Spam Trends at a Glance: 2001-2012, [online]
http://www.emailtray.com/blog/email-spam-trends-2001-2012/ (assessed 11/08/2013)
[5] Machine Learning Repository, Spambase Dataset, http://archive.ics.uci.edu/ml/datasets/Spambase
(assessed 11/04/2013).
[6] Rao M., David H., 2012, The Economics of Spam, Journal of Economic Perspectives, [online]
http://www.aeaweb.org/articles.php?doi=10.1257/jep.26.3.87 (assessed 11/08/2013)
[7] T. Claburn, Spam Costs Billions, February 2005, [online] http://www.informationweek.com/spamcosts-billions/59300834 (assessed 11/04/2013).

25

Appendix A

Table A 1. Spam filtering description


No

Variable name

Description

V1 - V48

48 continuous real [0,100] attributes of type word_freq_WORD =


percentage of words in the e-mail that match WORD, i.e. 100 *
(number of times the WORD appears in the e-mail) / total number of
words in e-mail. A "word" in this case is any string of alphanumeric
characters bounded by non-alphanumeric characters or end-ofstring.

V49 - V54

6 continuous real [0,100] attributes of type char_freq_CHAR =


percentage of characters in the e-mail that match CHAR, i.e. 100 *
(number of CHAR occurences) / total characters in e-mail

V55

continuous real [1,...] length of sequences of capital letters

V56

continuous integer [1,...] longest uninterrupted sequence of capital


letters

V57

continuous integer [1,...] total number of capital letters in the email

categorical 0,1 class label for spam (1) or ham (0)

Table A 2. Words and symbols present in the input data


V19 - you
V28 - 650
V37 - 1999

V1 - make

V10 - mail

V46 - edu

V2 - address

V11 - receive

V20 - credit

V29 - lab

V38 - parts

V47 - table

V3 - all

V12 - will

V21 - your

V30 - labs

V39 - pm

V48 - conference

V4 - 3d

V13 - people

V22 - font

V31 - telnet

V40 - direct

V49 - ;

V5 - our

V14 - report

V23 - 0

V32 - 857

V41 - cs

V50 - (

V6 - over

V15 - addresses

V24 - money

V33 - data

V42 - meeting

V51 - [

V7 - remove

V16 - free

V25 - hp

V34 - 415

V43 - original

V52 - !

V8 - internet

V17 - business

V26 - hpl

V35 - 85

V44 - project

V53 - $

V9 - order

V18 - email

V27 - george

V36 - technology

V45 - re

V54 - #

26

Table A 3. Partial likelihood test between model 1 and model 2


Residual
DF

Residual
Deviance

Model 2

3055

2347.7

Model 1

3054

2344.1

DF

Deviance

Pr(>Chi)

3.597

0.058

Table A 4. Partial likelihood test between model 3 and model 4


Residual
DF

Residual
Deviance

Model 4

3053

2330.1

Model 3

3050

2328.7

DF

Deviance

Pr(>Chi)

1.431

0.698

Table A 5. Partial likelihood test between model 5 and model 6


Residual
DF

Residual
Deviance

Model 6

3055

2087.7

Model 5

3054

2086.6

DF

Deviance

Pr(>Chi)

1.104

0.293

Table A 6. Partial likelihood test between model 7 and model 8


Residual
DF

Residual
Deviance

Model 8

3053

2083.3

Model 7

3050

2082.2

DF

Deviance

Pr(>Chi)

1.010

0.799

Table A 7. Model utility test for PC regression with untransformed variables

Model with intercept only


PC regression with
untransformed variables

Residual
DF

Residual
Deviance

3066

4113.4

3056

2411.7

DF

Deviance

Pr(>Chi)

10

1701.7

< 2.2e-16

Table A 8. Model utility test for PC regression with transformed variables

Model with intercept only


PC regression with
transformed variables

Residual
DF

Residual
Deviance

3066

4116

3056

2184.7

DF

Deviance

Pr(>Chi)

10

1931.3

< 2.2e-16

27

Table A 9. AIC and BIC for each fitted ARIMA model


d

Spam

Ham

AIC

BIC

AIC

BIC

2511.4

2523.0

2623.6

2636.2

2506.6

2522.1

2617.1

2633.9

2501.9

2521.3

2514.9

2526.6

2639.8

2652.4

2512.1

2527.6

2627.7

2644.5

2505.4

2524.8

2622.2

2643.2

2494.0

2509.5

2610.4

2627.2

2495.0

2514.4

2606.6

2627.6

2496.8

2520.1

2608.0

2633.3

2494.9

2514.3

2606.2

2627.3

2497.3

2520.6

2608.2

2633.5

2496.9

2524.0

2610.0

2639.5

2496.7

2520.0

2496.8

2524.0

2498.6

2529.6

2489.4

2497.1

2620.3

2628.8

2489.9

2501.5

2609.2

2621.8

2491.6

2507.1

2610.3

2627.2

2492.5

2511.9

2612.1

2633.2

0
0
0
0
0

28

Appendix B

Figure B 1. Factor plots on word and character frequencies

29

30

31

Figure B 2. Factor plots on sequences on capital letters in a message

32

Figure B 3. Interaction plot

33

34

35

You might also like