Edstrom RPT

CS 539 Introduction to Artificial Neural Networks and Fuzzy Systems Spring 2016
Detecting Spam with Artificial Neural Networks

Andrew Edstrom
University of Wisconsin - Madison
Abstract
This is my final project for CS 539. In this project, I demonstrate the suitability of
neural networks for the task of classifying spam emails. I discuss how I was able to attain a
classification accuracy of 94.6% through minor changes in network configuration and the
momentum alpha parameter, ultimately outperforming existing research on this same
dataset.
Keywords: Artificial Intelligence, Machine Learning, Neural Networks, Spam
Detection
I. Introduction
Neural networks are powerful tools for any machine learning task which involves
classification. They are utilized in a wide range of applications including recommendation engines,
computer vision, and dashboard customization. Because of their versatility, they are emerging as one
of the primary tools in the machine learning professionals toolkit.
However, neural networks are not as widely used in spam email classification as one might
expect. Instead, most modern spam filters employ nave Bayes classifiers, due in large part to Paul
Grahams famed article A Plan For Spam. Nave Bayes is a great approach for spam classification
with high accuracy and a low false-positive rate, but by itself it may not be enough to achieve the
99.99+% accuracy which we would like to see.

Google reported that introducing neural networks into Gmails spam filters took them from
99.5% to 99.9% accuracy, suggesting that neural networks may be useful for enhancing spam filters,
especially when used in conjunction with Bayesian classification and other methodologies. However,
there is not much research on the use of neural networks for spam detection, and most of the
existing research holds the network configuration, momentum, and learning rate as a constant,
investigating the effectiveness of the network across datasets rather than the suitability of different
network configurations for the task. In my project, I have done the opposite, holding the dataset
constant while adjusting the network configuration and parameters, in order to find the ideal
network configuration for spam classification.
II. Work Performed
Because I wanted to focus on network configuration rather than the dataset preparation, I
chose to use the UCI Spambase dataset. In this dataset, each email is assigned a label of spam or not
spam. There are 4601 emails, all of which have been processed to extract a number of features,
including the frequency of certain spammy words and the amount of capital letters used. Before
performing any experiments, I randomized this dataset once. I used this same random ordering in
each of my trials, so as to not skew my results.
To implement my neural network, I first attempted to use Caffe, a deep learning library from
UC Berkeley. However, I encountered numerous problems while attempting to get it to build on my
computer. I found Caffe to be a very poorly-documented open-source library which depended on
numerous other poorly-documented open-source libraries, all of which were themselves quite tricky
to install. It was shamefully complicated to even obtain all of the dependencies, some of which
requiring days of waiting for an application to be approved before they could even be download.
Unfortunately, in order to compile Caffe one must install all the correct versions of all the correct
libraries and put them in the correct location in the file system, which is completely different for
each library. Once the dependencies are downloaded and installed, you must set numerous
environment variables and manually configure a makefile of several hundred lines. Each time you
make a change to any one of these pieces along the way, it takes about 30 minutes of compilation to
determine whether it fixed the problem or not.
After well over 10 hours of fierce conflict with Caffe, I decided to explore other options.
After playing with several libraries, I settled on modifying a Matlab implementation of a feed-
forward MLP with backpropagation by Hesham Eraqi. I chose this implementation as a basis for my
project because it made it easy to change the network configuration, momentum alpha, number of
epochs, and learning rate, all by changing a single line in the Configurations/Parameters section.
Eraqis MLP implementation only supported calculation of training error, so I added code to
evaluate the network with a testing set once it had finished training. I added additional code to
calculate and display final results. After all 10 trials have completed, the testing errors are averaged.
The average testing error and the average training error are both displayed, because if there is a large
difference between the two this is a good indicator that the network is having a problem of
overfitting. I also display the network configuration, learning rate, and momentum alpha.
After some initial exploratory trials, I found that a learning rate of .1 was ideal. Trials with
several epoch sizes between 200 and 2000 showed that increasing the number of epochs did not give
any improvement in accuracy beyond 199 epochs, so I used 199 epochs for each trial. My actual
experiments consisted of 29 trials, each with its own configuration and parameter settings.
III. Results
Figure 1 shows how all of my experiments yielded accuracies in the 92-95% range,
demonstrating that neural networks have a fairly high accuracy regardless of the configuration or
parameters used. Across my trials I tested a wide range of configurations, from a single hidden layer
of eight neurons, to two hidden layers of five neurons, to three hidden layers of 50, 50, and 200
neurons. It seems that any neural network will perform fairly well, no matter its set-up, but through
fine tuning we can increase the performance by several percentage points.
Figure 1
I tested several numbers of hidden layers (Figure 2), and I tried many different sizes for each
layer. However, no matter the number of neurons per layer, a single layer proved to be ideal. Both
my lowest error and my lowest average error across trials came from networks with one layer.
Figure 2
Once I determined that one layer was sufficient, I tried several different numbers of neurons
for this layer (Figure 3). Preliminary tests showed that any number over 15 caused overfitting,
however I did one experiment with 40 just to confirm. Interestingly, 11 performed best,
outperforming both 10 and 12 by almost 0.5%. Combined with the previous experiment, it became
clear that a simple network always worked best. My best results came from networks with one
hidden layer of 11 hidden neurons. Networks that were larger either vertically or horizontally often
got a very low average training errorsometimes below 0.5%while testing error increased past
6%. I took this as a clear sign of overfitting.

Figure 3
The final parameter I tried adjusting was the momentum alpha (Figure 4). This variable had
a surprisingly large effect on the error rate. I performed several experiments, holding the network
configuration and learning rate constant, and found that networks with a momentum alpha of 0.1
dramatically outperformed those with momentum alphas that were higher or lower.
Figure 4
IV. Conclusions
Through my experiments, I found that the best configuration for spam detection on the UCI
Spambase dataset with a neural network is 11 hidden neurons in a single hidden layer, and a
momentum alpha of 0.1.
My results confirmed the findings of Idris, who used a neural network to classify spam on
this same dataset and attained an accuracy of 94.3%. Most of my results fell in this general range,
though after tweaking and experimentation I was able to train a network that slightly beat their best
result. Using what I found to be the ideal configuration, I attained an accuracy of 94.6%.
This goes to show that fine tuning of network configuration and parameters is quite
important in neural network research. Even though all neural networks will perform quite well,
adding just a single neuron can have a nontrivial effect on error rate. In my case, tiny changes like
this sometimes reduced my error rate by as much as half of a percent.
V. Future Work
Further researchers on this topic might consider looking at the false-positive rate of
networks with different configurations. In spam detection, false positives are essentially
unacceptable, and one of the primary advantages of nave Bayes is that it promises a low false-
positive rate. If one could develop a neural network with a very low false-positive rate, neural
networks would seem a much more viable option for commercial spam detection. It would be quite
interesting to see whether the network which yielded the lowest error rate also yielded the lowest
false-positive rate.
References
Graham, P. (2002). A Plan for Spam.
Idris, I. (2014). E-mail Spam Classification with Artificial Neural Network and Negative Selection
Algorithm. International Journal of Computer Science, 1.
Massey, B., et al., Learning Spam: Simple Techniques for Freely-Available Software, Proceedings of
Freenix Track 2003 Usenix Annual Technical Conference, Online!, Jun. 9, 2003, pp. 63-76,
Berkley, CA, USA.

Metz, Cade. "Google Says Its AI Catches 99.9 Percent of Gmail Spam." Wired.com. July 09, 2015.
Accessed May 12, 2016. http://www.wired.com/2015/07/google-says-ai-catches-99-9-
percent-gmail-spam/all/1.
Sallab, A. A., & Rashwan, M. A. (2012). E-Mail Classification Using Deep Networks. Journal of
Theoretical and Applied Information Technology, 37(2), 241-251.

Edstrom RPT

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Edstrom RPT

Uploaded by

Copyright:

Available Formats

CS 539 Introduction to Artificial Neural Networks and Fuzzy Systems Spring 2016

Detecting Spam with Artificial Neural Networks

momentum alpha parameter, ultimately outperforming existing research on this same

Keywords: Artificial Intelligence, Machine Learning, Neural Networks, Spam

of the primary tools in the machine learning professionals toolkit.

99.99+% accuracy which we would like to see.

network configuration for spam classification.

II. Work Performed

each of my trials, so as to not skew my results.

UC Berkeley. However, I encountered numerous problems while attempting to get it to build on my

computer. I found Caffe to be a very poorly-documented open-source library which depended on

determine whether it fixed the problem or not.

fine tuning we can increase the performance by several percentage points.

6%. I took this as a clear sign of overfitting.

momentum alpha of 0.1.

this sometimes reduced my error rate by as much as half of a percent.

Graham, P. (2002). A Plan for Spam.

Algorithm. International Journal of Computer Science, 1.

Berkley, CA, USA.

Accessed May 12, 2016. http://www.wired.com/2015/07/google-says-ai-catches-99-9-

Theoretical and Applied Information Technology, 37(2), 241-251.

You might also like