You are on page 1of 8

Managing Email Overload Through

Prioritization

Aditya Arcot Srinivasan Vidul Ayankulangara Panickan


Department of Computer Science Department of Computer Science
UMass Amherst UMassAmherst
asrinivasan@cs.umass.edu vayakulangar@cs.umass.edu

1 In trod u cti on
Emails are one of the most widely used forms of communication in today’s digital age. Each day
more than 205 billion emails are sent and received. Email overload is a real and increasing
problem with overwhelming statistics to prove it as put forth in [1]. It is vital that a user
optimizes time while dealing with daily emails. The goal of this project is managing email
overload. Our system can be used to manage an incoming email stream by helping the use r
prioritize which emails to respond to. The above stated paradigm is modelled as two separate
classification problems.
As described in [2], there a wide range and variety of emails:
• Read Only emails
• Emails in which a person is CCed in perhaps unnecessarily
• Spam
• Emails which require a response.

To deal with this problem of email overload, we propose a solution where we first classify a
given mail as spam or ham and then further classify a 'ham' email as important or not important.
An important email is one which is deemed to require a response. A tremendous amount of
research has been done in classifying emails as spam using SVMs, KNNs, Naive Bayes and
simple neural networks. We aim to explore these approaches and extend them by experimenting
with Deep Learning Models to provide a comprehensive email management experience.

Classifying an email categorized as 'ham' into subcategories based on importance is inherently a


difficult classification problem because of its subjective nature. We have collected and curated a
dataset of email data tagged with importance labels from our personal Gmail inboxes to provide
the training data for the given approach.

Thus, our system consists of two classification problems. We put forward a workflow containing
the best tuned model for both problems. The models we initally experiment with are KNNs,
Random Forests and Linear SVCs. On the deep learning side, we experimented with CNNs and
LSTMs. Using the best performing model for both systems we achieved a net accuracy of
86.83% for our entire system.

2 Rel ated Wo rk
Email spam classification is one of the classical research problems in Machine Learning and a
massive amount of work has already been performed in identifying spam emails [3][4][5]. SVM,
Naive Bayes, Neural Networks and KNN are popular techniques used for this task [6].
In experiments on deep learning for spam classification [7], models averaged 87.24%
classification accuracy, while SVM and KNN approaches averaged 70% and 63.86%
respectively. These findings suggest that Deep Learning can be a useful tool approach to spam
detection. Hence, we intend to experiment and compare performances of a number of models in
different settings for an email prioritization system. We pick a single Deep Learning model and
compare its performance with Random Forest, LinearSVC and KNN models. In addition to a
spam classifier we also build a ham classifier that prioritizes the emails as Important or Not
Important. There has been research conducted on prioritizing emails by categorizing the ones
requiring an action. Corston Oliver et al applied SVMs on a hand -collected dataset of 15k
emails [8]. Similarly, Dredze has worked on predicting if an email needs an answer [9], using
logistic regression with hand-crafted features.
Using our own dataset built from emails from our Gmail inbox, we validate our choice of
models. In the end, we put forward a system that encompasses two models in order to generate a
personalized Email Prioritization System.

3 Datasets
We use independent datasets to train our models for our classification problems. While labelled
spam data is readily available a bit of innovation had to go into procuring data for the
Important/Not Important classification task. We found one dataset, namely the Parakweet
dataset and created another one.
1) Parakweet Intent dataset [3]: For A hand-labeled subset of the Enron dataset labeled for
whether an email requires action or not. An email is “important” if it requires action and “not
important” if it doesn’t require an action. This dataset is used independently to train, test and
validate the Ham Classifier.
2) Aditya’s Gmail Dataset: We created our own dataset by downloading email data from one of
our personal Gmail Inboxes. Gmail has a system that automatically categorizes emails as
Important. We use this data as the ground truth Gmail categorizes all important emails in a
separate folder called “Important”. We sample an equal number of emails not marked as
“Important” in order to make up the other class.
The above two mentioned data sources serve as independent datasets for the Important/Not
Important classification task. For the Spam/Ham classification task we use a third dataset:
3) We collected all the emails from Aditya’s Gmail Dataset as the class representing “Ham” and
sampled an equal number of emails from the Enron Corpora to make up the spam class. This
was done because spam information is not contained in Gmail inboxes for longer than 30 days.
It is also conceivable that Spam data is more or less uniform across all sources.
For this classification task, we did not use any metadata information associated with an email
apart from the test contained in the subject and body of the email. The most frequent words in
terms of ordinary count and TFIDF in the entire email corpus/from individual categories of
email are taken as features. For the Deep Learning models, features were composed of word
sequences, where each word was mapped to its corresponding rank and then mapped with word
embeddings. A fixed word sequence length was chosen.
Every dataset was split in 8:2 ratio. 80% training and the rest 20 as test data.
3 Meth od ol ogy
Figure 1 gives an overall view of the architecture of our system:

Our Prioritization system is composed of two individual models:


• A spam/ham classifier
• Ham management classifier.
We tune the two models separately and then find the performance of the system as a whole. The
spam/ham classifier is run on a dataset formed by combining the Gmail ham dataset and a
public spam dataset whereas the ham management classifier is run on two di fferent datasets: a
public dataset and a personal Gmail dataset. For each classifier, we tune three different ML
models (KNN, Random Forest and Linear SVC) and compare their performance . The following
are the mathematical representation of the decision function for K Nearest Neighbors and Linear
SVC

KNN:

LinearSVC:
Spam/Ham Classifier
For tuning the performance of Spam/Ham classifier, we extracted an equal proportion of mails
from spam and inbox to make the dataset.
First Approach: Our first approach was to select the most frequent 3000 words from each class
(Spam and Inbox or imp/not_imp) and remove the words that are common to both classes to
build the model features. Thus the model features will be composed of features unique to both
spam and ham. By doing this, we assume that our model will be more discriminative.
Second Approach: Our second approach involves selecting the 3000 most frequent words from
the entire dataset and taking them as features. For each item (email) in the training data, we
check if a feature is in the item or not to generate feature values. This is one of the most
commonly used feature generation method.
tf-idf: In the third approach we use count of a particular word(feature) in an email as feature
values as then use the tf-idf representation to incorporate the weight of a particular feature.
Once the dataset is preprocessed and the features have been generated, then we perform the
following operations on the two approaches
• Find the performance of default classifiers on the test data
• Find the performance of the same classifiers after feature selection and hyperparameter
tuning on the test data
Finally, we compare the results obtained. We do this for parakweet, Gmail and Gmail + Enron
datasets.
Feature extraction: We use scikit implementation of Select_Kbest to find the K best features for
each model. Select_Kbest is a statistical dependency filtering method for finding the optimal
subset of features. We find the K best features for different K values for each classifier and use
the K that gives the best accuracy in select_Kbest method to generate a good subset of features,
thus removing noise and improving the individual classifiers performance.
Hyperparameter Tuning: We use Scikit Learn GridSearchCV for tuning the hyperparameters of
our model. For each combination of hyperparameters we do 5-fold cross validation to reduce
generalization error. The final range of values for each parameter were found out
experimentally. We started off with a wide range of values for different parameters and then
shortened the range by checking the accuracy of the model at different values within the range
and finally selecting the sub range where the model was giving high accuracy
Deep Learning Model
The framework we use to experiment with Deep Learning is Keras, over Tensor Flow in the backend.
The deep learning model that we chose to experiment with is a Convolutional Neural Network. CNNs
are basically several layers of convolutions with nonlinear activation functions like ReLU or tanh
applied to the results. In a traditional feedforward neural network we connect each input neuron to
each output neuron in the next layer. In CNNs we don’t do that. Instead, we use convolutions over the
input layer to compute the output.

For our experimental setup, our implementation is based on [11]. A sentence is represented as the
concatenation of the different word vectors corresponding to the words of the sentence, padding the
sentence with zeroes in cases as necessary in order to have a fixed sized input. The word vectors used
are from Glove. Specifically, we used the 100-dimensional GloVe embeddings of 400k words
computed on a 2014 dump of English Wikipedia. An overview of the approach we followed:

1) Convert all text samples in the dataset into sequences of word indices. A "word index" would
simply be an integer ID for the word. We will only consider the top 10,000 most commonly occuring
words in the dataset, and truncate the sequences to a maximum length of 400 words.
2) Prepare an "embedding matrix" which will contain at index i the embedding vector for the word of
index i in our word index. Then, load this embedding matrix into a Keras Embedding layer.
3) Build on top of it a 1D convolutional neural network, ending in a sigmoid output over 2 categories.

The model was run over all the datasets mentioned.


4 E xp eri men ts an d Resu l ts
Ham Classifier

Performance on Parakweet Intent Dataset


.Range of K values tried: 200,300,400,500,600,1000,1500,2000

Range of C values used:0.1 to 2.0 in steps of 0.1 K in KNN ranged from 1 to 10 in steps of 1
Max Features: Total_Features, Total_Features/2, Total_Features/3, Total_Features/4, log2
Max Depth ranged from 40 to 80 in steps of 5
The model using the second approach was giving 5% increased accuracy on average when compared
to the model using the first approach. Overall, the models that used tf-idf features outperformed the
other two approaches with an average accuracy of 79%. Considering the individual models used,
LinearSVC is found to perform better than KNN and Random forest for this task.
Aditya’s Gmail Dataset
The previous classification problem was performed on a public data corpora. To see how, the
classifiers fare on personal data, we run the on Aditya’s Gmail dataset. The following are the
results observed.
Range of K values tried: 200,400,800,1000,1200,1500,1800,2000,2200,2536,3000

Range of C values used:0.1 to 2.0 in steps of 0.1 K in KNN ranged from 1 to 10 in steps of 1
Max Features: Total_Features, Total_Features/2, Total_Features/3, Total_Features/4
Max Depth ranged from 40 to 80 in steps of 5
The performance remains unchanged after feature selection because the best value for K in
select K-best was the same as the initial number of features.In the second approach, the performance
remains unchanged after feature selection. Further after hyper parameter tuning, the performance of
KNN and SVC is found to decrease. This is because, during hyper parameter selection we are
performing 5 fold cross validation to decrease generalization error. The parameter values obtained are
different from default parameters and are generalized to perform well on unseen data.
Spam Classifier

Aditya’s Gmail Dataset (ham) + Enron Spam data (spam)


Range of K values used: 100,200,300 to 3000 in steps of 100

Range of C values used:0.1 to 2.0 in steps of 0.1 K in KNN ranged from 1 to 10 in steps of 1
Max Features: 50 to 200 in steps of 50 Max Depth ranged from 40 to 80 in steps of 5
All the models are found to perform relatively well on spam classification with LinearSVC using tf-
idf giving the best performance at 97.64%. The CNN gives the following results on 3 datasets:

Di s cu ssi on and Concl u si on


For the Parakweet dataset, the best performing model was a linear SVM model using TFIDF as
features with an accuracy 81.41 %.
The next choice was to pick a combination of two models that would evaluate our system as whole as
for the three-class problem of categorizing a mail as Spam, Important or Not Important. This would
occur as two separate classification problems that follow each other in a pipeline.
Based on our experiments we found that for the Spam/Ham Classification problem, the linear SVM
model and the CNN model performed the best with accuracies of 97.48% and 97.93% respectively.
For the Important/Not Important classification problem, the best performing classifiers were Linear
SVM with 79.94% and a Random Forest Classifier with 80.01% respectively.
For the first choice of models, we decided to go with the linear SVM model because of the fast
training time and low volume of data we possessed. We believe that with a much larger volume of
data and better tuning of the CNN model, the CNN could perform better, but for the current dataset
the SVM is a better choice considering space and time tradeoffs; especially if we would use a
personalized system that required retraining frequently. For the second choice, we again chose the
linear SVM model because of the superior performance over the Random Forest Classifier in the
Parakweet dataset. We believe that it would generalize better.
The net accuracy of the resulting system over an 80/20 overall split turned out to be 86.83%. The
confusion matrix for the same is shown below. As illustrated the worst performing class is
‘Important’, with the largest error coming from it being classified as ‘Not Important’. In terms of
future work, we would like to proceed by collecting more labelled data and tuning the parameters of
the CNN model to achieve a better accuracy.

R e f e re n c e s
[1] Jackson, Tom, and Emma Russell. "Four email problems that even titans of tech haven't resolved." The
Conversation (2015).
[2] T. Jackson. What to do with Employees that are too busy to manage their Email? Available online at
http://www.proni.gov.uk/what to do with employees that are too busy to manage their email tom jackson.pdf
[3] Youn, Seongwook, and Dennis McLeod. "A comparative study for email classification." Advances and
innovations in systems, computing sciences and software engineering. Springer Netherlands, 2007. 387-391..
[4] Awad, W. A., and S. M. ELseuofi. "Machine Learning methods for E-mail Classification." International
Journal of Computer Applications 16.1 (2011).
[5] Tretyakov, Konstantin. "Machine learning techniques in spam filtering." Data Mining Problem-oriented
Seminar, MTAT. Vol. 3. No. 177. 2004.
[6] Guzella, Thiago S., and Walmir M. Caminhas. "A review of machine learning approaches to spam filtering."
Expert Systems with Applications 36.7 (2009): 10206-10222.
[7] http://www.jatit.org/volumes/Vol37No2/12Vol37No2.pdf
[8] Corston-Oliver, Simon H., et al. "Integration of email and task lists." (2004).
[9] Dredze, Mark, John Blitzer, and Fernando Pereira. "Reply Expectation Prediction for Email Management."
CEAS. 2005.
[10] https://github.com/ParakweetLabs/EmailIntentDataSet
[11] Y. Kim. Convolutional Neural Networks for Sentence Classification. New York University

You might also like