Professional Documents
Culture Documents
Prioritization
1 In trod u cti on
Emails are one of the most widely used forms of communication in today’s digital age. Each day
more than 205 billion emails are sent and received. Email overload is a real and increasing
problem with overwhelming statistics to prove it as put forth in [1]. It is vital that a user
optimizes time while dealing with daily emails. The goal of this project is managing email
overload. Our system can be used to manage an incoming email stream by helping the use r
prioritize which emails to respond to. The above stated paradigm is modelled as two separate
classification problems.
As described in [2], there a wide range and variety of emails:
• Read Only emails
• Emails in which a person is CCed in perhaps unnecessarily
• Spam
• Emails which require a response.
To deal with this problem of email overload, we propose a solution where we first classify a
given mail as spam or ham and then further classify a 'ham' email as important or not important.
An important email is one which is deemed to require a response. A tremendous amount of
research has been done in classifying emails as spam using SVMs, KNNs, Naive Bayes and
simple neural networks. We aim to explore these approaches and extend them by experimenting
with Deep Learning Models to provide a comprehensive email management experience.
Thus, our system consists of two classification problems. We put forward a workflow containing
the best tuned model for both problems. The models we initally experiment with are KNNs,
Random Forests and Linear SVCs. On the deep learning side, we experimented with CNNs and
LSTMs. Using the best performing model for both systems we achieved a net accuracy of
86.83% for our entire system.
2 Rel ated Wo rk
Email spam classification is one of the classical research problems in Machine Learning and a
massive amount of work has already been performed in identifying spam emails [3][4][5]. SVM,
Naive Bayes, Neural Networks and KNN are popular techniques used for this task [6].
In experiments on deep learning for spam classification [7], models averaged 87.24%
classification accuracy, while SVM and KNN approaches averaged 70% and 63.86%
respectively. These findings suggest that Deep Learning can be a useful tool approach to spam
detection. Hence, we intend to experiment and compare performances of a number of models in
different settings for an email prioritization system. We pick a single Deep Learning model and
compare its performance with Random Forest, LinearSVC and KNN models. In addition to a
spam classifier we also build a ham classifier that prioritizes the emails as Important or Not
Important. There has been research conducted on prioritizing emails by categorizing the ones
requiring an action. Corston Oliver et al applied SVMs on a hand -collected dataset of 15k
emails [8]. Similarly, Dredze has worked on predicting if an email needs an answer [9], using
logistic regression with hand-crafted features.
Using our own dataset built from emails from our Gmail inbox, we validate our choice of
models. In the end, we put forward a system that encompasses two models in order to generate a
personalized Email Prioritization System.
3 Datasets
We use independent datasets to train our models for our classification problems. While labelled
spam data is readily available a bit of innovation had to go into procuring data for the
Important/Not Important classification task. We found one dataset, namely the Parakweet
dataset and created another one.
1) Parakweet Intent dataset [3]: For A hand-labeled subset of the Enron dataset labeled for
whether an email requires action or not. An email is “important” if it requires action and “not
important” if it doesn’t require an action. This dataset is used independently to train, test and
validate the Ham Classifier.
2) Aditya’s Gmail Dataset: We created our own dataset by downloading email data from one of
our personal Gmail Inboxes. Gmail has a system that automatically categorizes emails as
Important. We use this data as the ground truth Gmail categorizes all important emails in a
separate folder called “Important”. We sample an equal number of emails not marked as
“Important” in order to make up the other class.
The above two mentioned data sources serve as independent datasets for the Important/Not
Important classification task. For the Spam/Ham classification task we use a third dataset:
3) We collected all the emails from Aditya’s Gmail Dataset as the class representing “Ham” and
sampled an equal number of emails from the Enron Corpora to make up the spam class. This
was done because spam information is not contained in Gmail inboxes for longer than 30 days.
It is also conceivable that Spam data is more or less uniform across all sources.
For this classification task, we did not use any metadata information associated with an email
apart from the test contained in the subject and body of the email. The most frequent words in
terms of ordinary count and TFIDF in the entire email corpus/from individual categories of
email are taken as features. For the Deep Learning models, features were composed of word
sequences, where each word was mapped to its corresponding rank and then mapped with word
embeddings. A fixed word sequence length was chosen.
Every dataset was split in 8:2 ratio. 80% training and the rest 20 as test data.
3 Meth od ol ogy
Figure 1 gives an overall view of the architecture of our system:
KNN:
LinearSVC:
Spam/Ham Classifier
For tuning the performance of Spam/Ham classifier, we extracted an equal proportion of mails
from spam and inbox to make the dataset.
First Approach: Our first approach was to select the most frequent 3000 words from each class
(Spam and Inbox or imp/not_imp) and remove the words that are common to both classes to
build the model features. Thus the model features will be composed of features unique to both
spam and ham. By doing this, we assume that our model will be more discriminative.
Second Approach: Our second approach involves selecting the 3000 most frequent words from
the entire dataset and taking them as features. For each item (email) in the training data, we
check if a feature is in the item or not to generate feature values. This is one of the most
commonly used feature generation method.
tf-idf: In the third approach we use count of a particular word(feature) in an email as feature
values as then use the tf-idf representation to incorporate the weight of a particular feature.
Once the dataset is preprocessed and the features have been generated, then we perform the
following operations on the two approaches
• Find the performance of default classifiers on the test data
• Find the performance of the same classifiers after feature selection and hyperparameter
tuning on the test data
Finally, we compare the results obtained. We do this for parakweet, Gmail and Gmail + Enron
datasets.
Feature extraction: We use scikit implementation of Select_Kbest to find the K best features for
each model. Select_Kbest is a statistical dependency filtering method for finding the optimal
subset of features. We find the K best features for different K values for each classifier and use
the K that gives the best accuracy in select_Kbest method to generate a good subset of features,
thus removing noise and improving the individual classifiers performance.
Hyperparameter Tuning: We use Scikit Learn GridSearchCV for tuning the hyperparameters of
our model. For each combination of hyperparameters we do 5-fold cross validation to reduce
generalization error. The final range of values for each parameter were found out
experimentally. We started off with a wide range of values for different parameters and then
shortened the range by checking the accuracy of the model at different values within the range
and finally selecting the sub range where the model was giving high accuracy
Deep Learning Model
The framework we use to experiment with Deep Learning is Keras, over Tensor Flow in the backend.
The deep learning model that we chose to experiment with is a Convolutional Neural Network. CNNs
are basically several layers of convolutions with nonlinear activation functions like ReLU or tanh
applied to the results. In a traditional feedforward neural network we connect each input neuron to
each output neuron in the next layer. In CNNs we don’t do that. Instead, we use convolutions over the
input layer to compute the output.
For our experimental setup, our implementation is based on [11]. A sentence is represented as the
concatenation of the different word vectors corresponding to the words of the sentence, padding the
sentence with zeroes in cases as necessary in order to have a fixed sized input. The word vectors used
are from Glove. Specifically, we used the 100-dimensional GloVe embeddings of 400k words
computed on a 2014 dump of English Wikipedia. An overview of the approach we followed:
1) Convert all text samples in the dataset into sequences of word indices. A "word index" would
simply be an integer ID for the word. We will only consider the top 10,000 most commonly occuring
words in the dataset, and truncate the sequences to a maximum length of 400 words.
2) Prepare an "embedding matrix" which will contain at index i the embedding vector for the word of
index i in our word index. Then, load this embedding matrix into a Keras Embedding layer.
3) Build on top of it a 1D convolutional neural network, ending in a sigmoid output over 2 categories.
Range of C values used:0.1 to 2.0 in steps of 0.1 K in KNN ranged from 1 to 10 in steps of 1
Max Features: Total_Features, Total_Features/2, Total_Features/3, Total_Features/4, log2
Max Depth ranged from 40 to 80 in steps of 5
The model using the second approach was giving 5% increased accuracy on average when compared
to the model using the first approach. Overall, the models that used tf-idf features outperformed the
other two approaches with an average accuracy of 79%. Considering the individual models used,
LinearSVC is found to perform better than KNN and Random forest for this task.
Aditya’s Gmail Dataset
The previous classification problem was performed on a public data corpora. To see how, the
classifiers fare on personal data, we run the on Aditya’s Gmail dataset. The following are the
results observed.
Range of K values tried: 200,400,800,1000,1200,1500,1800,2000,2200,2536,3000
Range of C values used:0.1 to 2.0 in steps of 0.1 K in KNN ranged from 1 to 10 in steps of 1
Max Features: Total_Features, Total_Features/2, Total_Features/3, Total_Features/4
Max Depth ranged from 40 to 80 in steps of 5
The performance remains unchanged after feature selection because the best value for K in
select K-best was the same as the initial number of features.In the second approach, the performance
remains unchanged after feature selection. Further after hyper parameter tuning, the performance of
KNN and SVC is found to decrease. This is because, during hyper parameter selection we are
performing 5 fold cross validation to decrease generalization error. The parameter values obtained are
different from default parameters and are generalized to perform well on unseen data.
Spam Classifier
Range of C values used:0.1 to 2.0 in steps of 0.1 K in KNN ranged from 1 to 10 in steps of 1
Max Features: 50 to 200 in steps of 50 Max Depth ranged from 40 to 80 in steps of 5
All the models are found to perform relatively well on spam classification with LinearSVC using tf-
idf giving the best performance at 97.64%. The CNN gives the following results on 3 datasets:
R e f e re n c e s
[1] Jackson, Tom, and Emma Russell. "Four email problems that even titans of tech haven't resolved." The
Conversation (2015).
[2] T. Jackson. What to do with Employees that are too busy to manage their Email? Available online at
http://www.proni.gov.uk/what to do with employees that are too busy to manage their email tom jackson.pdf
[3] Youn, Seongwook, and Dennis McLeod. "A comparative study for email classification." Advances and
innovations in systems, computing sciences and software engineering. Springer Netherlands, 2007. 387-391..
[4] Awad, W. A., and S. M. ELseuofi. "Machine Learning methods for E-mail Classification." International
Journal of Computer Applications 16.1 (2011).
[5] Tretyakov, Konstantin. "Machine learning techniques in spam filtering." Data Mining Problem-oriented
Seminar, MTAT. Vol. 3. No. 177. 2004.
[6] Guzella, Thiago S., and Walmir M. Caminhas. "A review of machine learning approaches to spam filtering."
Expert Systems with Applications 36.7 (2009): 10206-10222.
[7] http://www.jatit.org/volumes/Vol37No2/12Vol37No2.pdf
[8] Corston-Oliver, Simon H., et al. "Integration of email and task lists." (2004).
[9] Dredze, Mark, John Blitzer, and Fernando Pereira. "Reply Expectation Prediction for Email Management."
CEAS. 2005.
[10] https://github.com/ParakweetLabs/EmailIntentDataSet
[11] Y. Kim. Convolutional Neural Networks for Sentence Classification. New York University