You are on page 1of 13

Spam Email Classifier

With the addition of POS tags

CSE 390
By: Erik Emmanuel

Brian Lin

Jianneng (Jack) Wu

05/04/2017

Professor Niranjan Balasubramanian

TA: Noah Weber


Spam Email Classifier

Natural Language Processing has allowed us to be able to allow

computers to extract the meaning behind our language. This has given us

the proper tools to be able to build the spam email classifier. The classifier

is designed to filter emails by removing the emails that were reported to be

sent by another computer. We choose this topic for our project because we

were building something that in turn could help save people a lot of time.

Although the limits of our project are extended to a limit dataset, we can

understand the workings of how filtering can be done. The project will

consist of 3 parts. In part 1 we build the training data needed to make the

classifier. Part 2 will then be to build the Naive Bayes algorithm and test it

on a new set of data. Afterwards we will also measure the performance of

the classifier measuring F-1 score, recall and precision. The last part will

then to be implement part of speech tagging to further improve the

performance of the classifier.


Probability generation for training dataset (Part 1: Erik)

Vocab set generation:

For the data extraction, We used the data we acquired from the Enron dataset.

From the dataset we used the preprocessed data which was already categorized as

either spam or not spam (ham). In the training set there was 5975 unique emails, and of

those there was 1:3 ratio of spam to ham emails. For the probability generation aspect

of the project we made a python script that goes through a folder, opens each folder

and and for each line in the email tokenize the word. Make counts of each word and

how many times they had occurred in each of the classes (spam or ham). For every

new word that was observed it was added to the vocab array. Since the dataset was

large, I omitted tokens from emails that were not words; ie. 12 was not considered as it

is a number while NBA16 would be an example of an alphanumeric token that was

considered part of the vocab. I also omitted words from the vocab that occurred less

than 2 times.Once the dataset was finalized, I had to attach probabilities that the word

could lead to the document being classified as a spam or ham email.

Probability Extraction:

Once the vocab array was finalized, now we had to go through each object in the

list and calculate the probability that word either lead to a spam or ham. To determine

the P(spam|word) and P(ham|word).


#(wi occurred in spam)
P (wi |spam) = #(wi occurred in total)

#(wi occurred in ham)


P (wi |ham) = #(wi occurred in total)
After finding these probabilities for each word in the vocabulary, we output these data in

a readable format onto the an out.txt file which is then passed onto the algorithm to

predict the test set. The following diagram can be used as a visual tool to trace the flow

of the program.
Implementation of Naive Bayes Algorithm (Part 2: Brian)

At this point we had our training data from the training emails in the form of

out.txt. Now our task was to create a process in which we can process new unknown

emails and determine from the training data whether or not the email is valid or spam.

This is where the Naive Bayes Algorithm comes into play.

The Naive Bayes Algorithm is an algorithm that takes a list of words as its input

and outputs a value that corresponds to whether or not that list of words is more likely to

be in a spam or valid email.

The left side of the equation is

the value we are computing. The first term is a constant that changes based on the size

of the training data we use. P(S) is determined by taking the number of spam and valid

emails and determining the probability of the email picked is spam. In our case we had

1500 spam emails and 3672 valid emails. Taking this P(S) would be 1500/(3672 +

1500). The same applies to P(H) where to denotes the probability that the email picked

is valid email. The next part of the equation with the summation of log probabilities takes

a word and its probability that the word is in spam over the probability that the word is in

a valid email and takes the log of the quotient. The log incorporation is important

because it determines if the value computing is being added or subtracted. This means

that any log probability between 0 and 1 is going to be negative whereas any log

probability above 1 is going to be positive. This is important because whenever the


probability of the word being in spam is greater than it being in a valid email then the

value of log((w|S)/(w|H)) is going to be positive. The only way that a email is classified

as spam is when the final value of the right side is above 0.

We tested this algorithm on the test set called enron2 which contained 4361 valid

emails and 1496 spam emails. Running through the emails with the implemented

algorithm gave us the results:

Spam: Recall = 0.4525 Precision = 0.9713 F-1 = 0.6174

Ham: Recall = 0.8167 Precision = 0.7737 F-1 = 0.7947

The precision in the classifier for spam emails was amazing, almost every email that the

classifier predicted as spam came out to actually be spam. However the recall of spam

emails was very poor. Less than half of the emails that were spam were detected by the

classifier. I believe the reason behind this is the training data. Because of the fact that

the number of spam emails was much less than the number of the valid emails in the

training set, this causes the words in the training set to have more frequent words where

the probability of the word being in a valid email is greater than the probability of the

word in a spam email. This affects the equation, more precisely . The

fraction will then be a value between 0 and 1 and any log of that value will produce a

negative value. Adding a negative to the equation will skew the equation to return that

the email is valid and for that the recall is imperfect. This could be fixed if

we were to have more data for training where the ratio of spam and valid

emails were closer to 1.


Figure 2.1 Flowchart of Algorithm

Implementation
POS Tagging and the Naive Bayes Algorithm (Part 3: Jack)

Training:

Using the Enron1 data from the Enron dataset1, I tagged the data using a POS tagger2

from Stanford. I chose to use this over using the one that we made for two reasons; Its

much more accurate (97% Precision), and its much for efficient. With so much training

data, efficiently tagging the data is quite important. Passing the paths of the data to

pos_tagger.java , outputs the results a directory named enron_tagged. Then using

train_tagger.py, and the path to the tagged ham/spam folders, we can calculate their

emissions probabilities: P (T |W ). This is outputted as emissions_ham.txt and

emissions_spam.txt.

Modifying the algorithm:

Given the existing algorithm from a research paper3 by Tianhao Sun, we were able to

implement the existing algorithm as follows:

n
P (S|E) P (S) P (w |S)
log P (H|E) = log P (H) + log P (w i|H)
i
i=1

Where if the result is greater than zero, then we classify if as spam. For the POS

adjustment I will be integrating interpolation into the function for one main reason: I

generally understand the concept and feel this is a good and simple probability

distribution for combining two features. I will be modifying the equation as follows:

1 P (wi S)+2 P (wi |S,ti ) 1 P (wi H)+2 P (wi |H,ti )


P (wi |S, ti ) = P (S) and P (wi |H, ti ) = P (H)
Where 1 and 2 are weights assigned to each feature. For our specific case we used

1=0.9 and 2=0.1 because these weights provided generally the best results. The final

equation for determining if the email is classified as a spam is then as follows:

n
P (S|E) P (S) P (w |S,t )
log P (H|E) = log P (H) + log P (w i|H,ti )
i i
i=1

*** At this point is where Im not sure if I normalized correctly as there are little to no

research papers detailing the addition of POS tags with respect to the Naive Bayes

Algorithm.

Testing:

To test the algorithm, I used the stanford tagger to tag the next set of the enron dataset

(enron2). I then modified the original code from Brian Lin (WordProb.py) to adjust for my

changed input data along with additional emissions data, and created the code

WordProbPOS.py . After optimizing the code, I ran the algorithm a few times to test sets

of weights that would be appropriate and settled on 0.9 and 0.1 respectively.

Results and Concerns With the POS Method:

From the results (specified below), we can see that with the addition of the POS tags,

we have decreased the precision of the algorithm by around 5% with respect to the

original algorithm, but have increased the recall by an astounding 45%. This is where

my concerns lie: Im not 100% certain that I normalized the probabilities correctly, and
also not certain if the original algorithm was accurate as well. Both these could have

skewed the results, leading to an over appreciation for the POS addition.

Results (With and without POS):


Precision (Spam w/out POS):
Number of times right/Number of times predicted
677/697 = 0.971305595
Recall (Spam):
Number of times right/Number of spam in the set
677/1496 = 0.452540107
F-1 measurement (Spam):
2((p*r)/(p+r) = 0.61741906061

2
Precision (Ham w/out POS):
Number of times right/Number of times predicted
2865/4361 = 0.773796791
Recall (Ham):
Number of times right/Number of spam in the set
3562/4361 = 0.816785141
F-1 measurement (Ham):
2((p*r)/(p+r) = 0.794710047093

Precision (Spam w/ POS):


Number of times right/Number of times predicted
1246/1345 = 0.926394052
Recall (Spam w/ POS):
Number of times right/ Number of spam in the set
1246/1496 = 0.832887701
F-1 measurement (Spam w/ POS):
2((p*r)/(p+r)) = 0.877155931

Precision (Ham w/ POS):


Number of times right/Number of times predicted
4261/4505 = 0.9458379578
Recall (Ham w/ POS):
Number of times right/ Number of spam in the set
4017/4361 = 0.9211190094
F-1 measurement (Ham w/ POS):
2((p*r)/(p+r))= 0.9333148412
General Flow of Classifier:
Conclusion:

The motivation for this project was to be able to detect spam emails given a set

of unclassified set of emails. To accomplish this goal we used the Naive Bayes model to

write the algorithm. We were able to obtain a fairly large training and test sets from

csmining.org. Using these our Naive Bayes model was able to precisely predict the

class of a given email, but our initial model was not very good at recalling the correct

amount of spam emails. So we set out to add other mechanisms to the Bayes model to

improve our initial algorithm so we looked toward using POS tags to be incorporated

into the formula. With the incorporating the POS tags and tinkering with the weights we

were able to reach a happy medium where both recall and precision was fairly high. But

we are a little suspicious about the improvement we were able to achieve with the

incorporation of POS tags because of our uncertainty on if we normalized the tags

properly. In the future, we can look toward adding smoothing techniques to account for

a broader variety of words that can appear in any given email irrelevant of its class it

may be associated with. Another direction we can move toward is prove the

normalizations we made on the POS classifier were correct, giving us a boost in

confidence on the system we created using POS tags, which gave us a better recall.

We can they focus on looking into adding more metadata probabilities to add onto the

current formula to improve the classifier's ability to detect spam even more easily.
Bibliography

[1] Enron Dataset: http://csmining.org/index.php/enron-spam-datasets.html

[2] Stanfords POS Tagger: https://nlp.stanford.edu/software/tagger.shtml

[3] Tianhao Suns paper on Naive Bayes Algorithm:

http://www.cs.ubbcluj.ro/~gabis/DocDiplome/Bayesian/000539771r.pdf

You might also like