You are on page 1of 9

CS 539 Introduction to Artificial Neural Networks and Fuzzy Systems Spring 2016

Detecting Spam with Artificial Neural Networks


Andrew Edstrom
University of Wisconsin - Madison

Abstract

This is my final project for CS 539. In this project, I demonstrate the suitability of

neural networks for the task of classifying spam emails. I discuss how I was able to attain a

classification accuracy of 94.6% through minor changes in network configuration and the

momentum alpha parameter, ultimately outperforming existing research on this same

dataset.

Keywords: Artificial Intelligence, Machine Learning, Neural Networks, Spam

Detection

I. Introduction

Neural networks are powerful tools for any machine learning task which involves

classification. They are utilized in a wide range of applications including recommendation engines,

computer vision, and dashboard customization. Because of their versatility, they are emerging as one

of the primary tools in the machine learning professionals toolkit.

However, neural networks are not as widely used in spam email classification as one might

expect. Instead, most modern spam filters employ nave Bayes classifiers, due in large part to Paul

Grahams famed article A Plan For Spam. Nave Bayes is a great approach for spam classification

with high accuracy and a low false-positive rate, but by itself it may not be enough to achieve the

99.99+% accuracy which we would like to see.


CS 539 Introduction to Artificial Neural Networks and Fuzzy Systems Spring 2016

Google reported that introducing neural networks into Gmails spam filters took them from

99.5% to 99.9% accuracy, suggesting that neural networks may be useful for enhancing spam filters,

especially when used in conjunction with Bayesian classification and other methodologies. However,

there is not much research on the use of neural networks for spam detection, and most of the

existing research holds the network configuration, momentum, and learning rate as a constant,

investigating the effectiveness of the network across datasets rather than the suitability of different

network configurations for the task. In my project, I have done the opposite, holding the dataset

constant while adjusting the network configuration and parameters, in order to find the ideal

network configuration for spam classification.

II. Work Performed

Because I wanted to focus on network configuration rather than the dataset preparation, I

chose to use the UCI Spambase dataset. In this dataset, each email is assigned a label of spam or not

spam. There are 4601 emails, all of which have been processed to extract a number of features,

including the frequency of certain spammy words and the amount of capital letters used. Before

performing any experiments, I randomized this dataset once. I used this same random ordering in

each of my trials, so as to not skew my results.

To implement my neural network, I first attempted to use Caffe, a deep learning library from

UC Berkeley. However, I encountered numerous problems while attempting to get it to build on my

computer. I found Caffe to be a very poorly-documented open-source library which depended on

numerous other poorly-documented open-source libraries, all of which were themselves quite tricky

to install. It was shamefully complicated to even obtain all of the dependencies, some of which

requiring days of waiting for an application to be approved before they could even be download.

Unfortunately, in order to compile Caffe one must install all the correct versions of all the correct
CS 539 Introduction to Artificial Neural Networks and Fuzzy Systems Spring 2016

libraries and put them in the correct location in the file system, which is completely different for

each library. Once the dependencies are downloaded and installed, you must set numerous

environment variables and manually configure a makefile of several hundred lines. Each time you

make a change to any one of these pieces along the way, it takes about 30 minutes of compilation to

determine whether it fixed the problem or not.

After well over 10 hours of fierce conflict with Caffe, I decided to explore other options.

After playing with several libraries, I settled on modifying a Matlab implementation of a feed-

forward MLP with backpropagation by Hesham Eraqi. I chose this implementation as a basis for my

project because it made it easy to change the network configuration, momentum alpha, number of

epochs, and learning rate, all by changing a single line in the Configurations/Parameters section.

Eraqis MLP implementation only supported calculation of training error, so I added code to

evaluate the network with a testing set once it had finished training. I added additional code to

calculate and display final results. After all 10 trials have completed, the testing errors are averaged.

The average testing error and the average training error are both displayed, because if there is a large

difference between the two this is a good indicator that the network is having a problem of

overfitting. I also display the network configuration, learning rate, and momentum alpha.

After some initial exploratory trials, I found that a learning rate of .1 was ideal. Trials with

several epoch sizes between 200 and 2000 showed that increasing the number of epochs did not give

any improvement in accuracy beyond 199 epochs, so I used 199 epochs for each trial. My actual

experiments consisted of 29 trials, each with its own configuration and parameter settings.

III. Results

Figure 1 shows how all of my experiments yielded accuracies in the 92-95% range,

demonstrating that neural networks have a fairly high accuracy regardless of the configuration or
CS 539 Introduction to Artificial Neural Networks and Fuzzy Systems Spring 2016

parameters used. Across my trials I tested a wide range of configurations, from a single hidden layer

of eight neurons, to two hidden layers of five neurons, to three hidden layers of 50, 50, and 200

neurons. It seems that any neural network will perform fairly well, no matter its set-up, but through

fine tuning we can increase the performance by several percentage points.

Figure 1

I tested several numbers of hidden layers (Figure 2), and I tried many different sizes for each

layer. However, no matter the number of neurons per layer, a single layer proved to be ideal. Both

my lowest error and my lowest average error across trials came from networks with one layer.
CS 539 Introduction to Artificial Neural Networks and Fuzzy Systems Spring 2016

Figure 2

Once I determined that one layer was sufficient, I tried several different numbers of neurons

for this layer (Figure 3). Preliminary tests showed that any number over 15 caused overfitting,

however I did one experiment with 40 just to confirm. Interestingly, 11 performed best,

outperforming both 10 and 12 by almost 0.5%. Combined with the previous experiment, it became

clear that a simple network always worked best. My best results came from networks with one

hidden layer of 11 hidden neurons. Networks that were larger either vertically or horizontally often

got a very low average training errorsometimes below 0.5%while testing error increased past

6%. I took this as a clear sign of overfitting.


CS 539 Introduction to Artificial Neural Networks and Fuzzy Systems Spring 2016

Figure 3

The final parameter I tried adjusting was the momentum alpha (Figure 4). This variable had

a surprisingly large effect on the error rate. I performed several experiments, holding the network

configuration and learning rate constant, and found that networks with a momentum alpha of 0.1

dramatically outperformed those with momentum alphas that were higher or lower.
CS 539 Introduction to Artificial Neural Networks and Fuzzy Systems Spring 2016

Figure 4

IV. Conclusions

Through my experiments, I found that the best configuration for spam detection on the UCI

Spambase dataset with a neural network is 11 hidden neurons in a single hidden layer, and a

momentum alpha of 0.1.

My results confirmed the findings of Idris, who used a neural network to classify spam on

this same dataset and attained an accuracy of 94.3%. Most of my results fell in this general range,

though after tweaking and experimentation I was able to train a network that slightly beat their best

result. Using what I found to be the ideal configuration, I attained an accuracy of 94.6%.

This goes to show that fine tuning of network configuration and parameters is quite

important in neural network research. Even though all neural networks will perform quite well,
CS 539 Introduction to Artificial Neural Networks and Fuzzy Systems Spring 2016

adding just a single neuron can have a nontrivial effect on error rate. In my case, tiny changes like

this sometimes reduced my error rate by as much as half of a percent.

V. Future Work

Further researchers on this topic might consider looking at the false-positive rate of

networks with different configurations. In spam detection, false positives are essentially

unacceptable, and one of the primary advantages of nave Bayes is that it promises a low false-

positive rate. If one could develop a neural network with a very low false-positive rate, neural

networks would seem a much more viable option for commercial spam detection. It would be quite

interesting to see whether the network which yielded the lowest error rate also yielded the lowest

false-positive rate.

References

Graham, P. (2002). A Plan for Spam.

Idris, I. (2014). E-mail Spam Classification with Artificial Neural Network and Negative Selection

Algorithm. International Journal of Computer Science, 1.

Massey, B., et al., Learning Spam: Simple Techniques for Freely-Available Software, Proceedings of

Freenix Track 2003 Usenix Annual Technical Conference, Online!, Jun. 9, 2003, pp. 63-76,

Berkley, CA, USA.


CS 539 Introduction to Artificial Neural Networks and Fuzzy Systems Spring 2016

Metz, Cade. "Google Says Its AI Catches 99.9 Percent of Gmail Spam." Wired.com. July 09, 2015.

Accessed May 12, 2016. http://www.wired.com/2015/07/google-says-ai-catches-99-9-

percent-gmail-spam/all/1.

Sallab, A. A., & Rashwan, M. A. (2012). E-Mail Classification Using Deep Networks. Journal of

Theoretical and Applied Information Technology, 37(2), 241-251.

You might also like