Final Doc SPAM

CHAPTER 1
INTRODUCTION
1.1 OVERVIEW
It is more challenging and complicated work tofind compromised machines and
spammers on theInternet. These compromised machines, which areknown as machines are
increasingly usedto spread various security attacks like spreading malware and spamming.
Attackers can recruita large number of compromised machines in anetwork using spamming
activity. An email spamis unsolicited and anonymous email which has beensent to a large
group of users. The main focus ofthis proposed system is on detection of users who
areinvolved in spamming activity and detection of emailattachments with viruses. A spam is
a compromisedmachine which is involved in spammingactivity. Spammers perform various
security attackslike capturing secrete data of users, click frauds,phishing, etc. So it is
necessary to identify andblock such spammers in a network. The proposed system detects and
blocks spammers in a network.The existing spammer detection method detect thespammers
in a social network. But our aim is tohelp system administrators to detect the spammersin
their own networks. The system deletes emailswith attachments having virus files. To
reactivatethe email account, user needs to pass a test. ASPOT detection algorithm is used by
the system todetect spammers. The proposed system assists thesystem administrators in
automatically identifyingspammers in an online manner. In addition to this italso helps in
identifying emails having viruses.The SPOT detection algorithm is based on astatistical tool
known as Sequential Probability Ratio Test (SPRT). Researcher Wald has designed this
statistical tool in his seminal work.
1.2 AIM & OBJECTIVE

The goalof SPRT is to detect which hypothesis is correctbased on two threshold
values. The main advantageof SPRT is that it requires a small number ofobservations to reach
to a decision for a givenrate of errors. It can detect the spammers usingother methods such as
PT(Percentage Threshold) and CT(Count Threshold). The PT algorithmcalculates the
percentage of spam messages sent bya same user in one time window. The CT
algorithmcounts the number of spam messages sent by asame user in one time window. If the
calculatedpercentage in PT and spam count in CT crossesthe predefined percentage threshold
value, then thatuser is detected as a spammer. Disadvantages of CTand PT algorithms are:
one is selecting the “right”values for the input parameters of PT and CT aremuch more
challenging and tricky task. Second, theperformance of PT and CT algorithms is sensitive
tothese input parameters. SPOT is one more spammerdetection algorithm, which detects the
spammer inan online manner. The SPOT algorithm analyses thetotal number of messages
sent by the machine andthe user rather than only analyzing the rate at whichthey are sent.
1.3 PROBLEM DESCRIPTION

The previous research work has focused on bothmachine learning and non-machine
learning approachesfor detection of spam. The machine learningapproach includes unified
model filtering andensemble filtering to classify received emails asspam or non-spam.
Previous works on machinelearning techniques have used Artificial Neural Network(ANN),
K-nearest Nave Bayesian, and Support Vector Machine (SVM). It can used simplecontent
based spam filtering technique. Because ourmain focus is on spammer detection.Here the
existing approaches for spammer detectionare given as follows.
Detection of spammers in social networks
Developed a system to detect spammers andfake profiles in social networks. They
havedesigned one algorithm which is based on thesocial network’s topology. Based on
anomaliesin the topology of the social networks thealgorithm detects users who randomly
connectto others.
Spammer detection
It developed a system todetect review spammers of online store based on review
graph. This system detects reviewspammers by using an iterative method andreview graph
model. The advantage of thissystem is that it uses relationships betweenreview, reviewer and
the store instead of onlybehavioral characteristics of the reviewer. Thismethod shows how the
information in the reviewgraph indicates the causes for spammingand reveals.They have
proposed a model which effectivelyintegrates both content information and socialnetwork
information to detect spammersin micro blogging. They have designed a newframework for
Social Spammer Detection inMicro blogging (SSDM). To model the refinedsocial networks a
framework employs a directedPalladian formulation, and then integratesthe network
information into a sparsesupervised formulation for the modeling ofcontent information.The
previous work of spammer detection focuseson aggregate characteristics of spammers. But
thesesystems do not suite for a network where thenumber of messages generated is less.
1.4 PROBLEM STATEMENT
Proposeda method for detecting spammers on Mail.They have constructed a large
labeled collectionof users by manually classifying into spammersand non-spammers. Then
they identifya number of characteristics related to tweetcontent and user social behavior to
detect spammers.For classifying users as either spammersor non spammers they used these
characteristicsas attributes of the machine learning process.In this approach, each user is
represented bya vector of values, one for each attribute. Thealgorithm learns a classification
model from aset of previously labeled (i.e., pre-classified)data, and then applies the acquired
knowledgeto classify new (unseen) users into twoclasses: spammers and non-spammers. But
thisapproach requires a large number of messageviews.
1.5THESIS CONTRIBUTION
The proposed system aims to deletes email withattached viruses. It detects and blocks
the spammersby using a SPOT detection algorithm. The accountreactivation test is provided
by the system. The system receives an email message. Thensystem checks for virus in the
attachment.The system deletes email having virus filesin an attachment. If the virus is not
found,then the message is checked for spam. Thesystem applies SPOT detection algorithm
todetect spammers.
1) Virus checks
2) Spam check and spamFilter
3) Blocking of spammers using SPOT andrecovery.
1.6 RESEARCH DESIGN
The various kinds of data that can be analyzed from e-mail traffic, and the levels of
privacy involved. Secondly, it gives a brief overview of link analysis techniques that can be
applied for network security. Further, our approaches are explained in detail. Results of
experimental evaluation of our approaches are presented. Theissues of identifying the
machines that are sending spam, or machines that has been compromised and is being used as
a spam relay. Note that our focus is not on identifying individual users who send spam, or
filtering an e-mail as spam based on its content. There has been work in such areas which is
not directly related to ours. Recent work on detection of spam Trojans suggests the use of
signature and behavior based techniques. In this they propose SpamMail, a new approach to
ranking and classifying emails according to the address of email senders. The central
procedure is to collect data about trusted email addresses from different sources and to create
a graph for the social network, derived from each user’s communication circle.
There are two SpamMail variants, which both apply a power-iteration algorithm on
the email network graph: Basic SpamMail results in a global reputation for each known email
address, and Personalized SpamMail computes a personalized trust value. SpamMail allows
to classify email addressesinto‘spammer address’ and ‘non-spammer address’ and
additionally to determine the relative rank of an email address with respect to other email
addresses. And alsoanalyzes the performance of SpamMail under several scenarios, including
sparse networks, and shows its resilienceagainst spammer attacks. They investigated the
feasibility of SpamMail, a new email ranking and classification scheme, which intelligently
exploits the social communication network created via email interactions. On the resulting
email network graph, a power- iteration algorithm is used to rank trustworthy senders and to
detect spammers. Mail-Rank performs well both in the presence of very sparse networks:
Even in case of a low participation rate, it can effectively distinguish between spammer email
addresses and non-spammer ones, even for those users not participating actively. SpamMail
is also very resistant against spammer attacks and, in fact, has the property that when more
spammer email addresses are introduced into the system, the performance of SpamMail
increases.
1.7 CHALLENGES OF THESIS

A major security challenge on the Internet is the existence of the large number of
compromised machines. Such machines have been increasingly used to launch various
security attacks including DDoS, spamming, and identity theft. Two natures of the
compromised machines on the Internet sheer volume and wide spread render many existing
security countermeasures less effective and defending attacks involving compromised
machines extremely hard. On the other hand, identifying and cleaning compromised
machines in a network remain a significant challenge for system administrators of networks
of all sizes.
In this thesis itfocuses on the subset of compromised machines that are used for
sending spam messages, which are commonly referred to as spam messages. Given that
spamming provides a critical economic incentive for the controllers of the compromised
machines to recruit these machines, it has been widely observed that many compromised
machines are involved in spamming. A number of recent research efforts have studied the
aggregate global characteristics of spamming botnets (networks of compromised machines
involved in spamming) such as the size of botnets and the spamming patterns of botnets,
based on the sampled spam messages received at a large email service provider.
Rather than the aggregate global characteristics of spamming botnets, we aim to

develop a tool for system administrators to automatically detect the compromised machines
in their networks in an online manner. We consider ourselves situated in a network and ask
the following question: How can we automatically identify the compromised machines in the
network as outgoing messages pass the monitoring point sequentially? The approaches
developed in the previous work cannot be applied here. The locally generated outgoing
messages in a network normally cannot provide the aggregate large-scale spam view required
by these approaches. Moreover, these approaches cannot support the online detection
CHAPTER 2
LITERATURE SURVEY
2.1 BACK GROUND
In this chapter, focusing on the studies that utilize spamming activities to detect
bots.Based on email messages received at a large email service provider, two recent studies
[2, 3] investigated the aggregate global characteristics of spamming botnets including the size
of botnets and the spamming patterns of botnets. These studies provided important insights
into the aggregate global characteristics of spamming botnets by clustering spam messages
received at the provider into spam campaigns using embedded URLs and near-duplicate
content clustering, respectively. However, their approaches are better suited for large email
service providers to understand the aggregate global characteristics of spamming botnets
instead of being deployed by individual networks to detect internal compromised machines.
Moreover, their approaches cannot support the online detection requirement in the network
environment considered in this thesis. We aim to develop a tool to assist system
administrators in automatically detecting compromised machines in their networks in an
online manner.
Xie, et al. developed an effective tool DBSpam to detect proxy-based spamming activities
in a network relying on the packet symmetry property of such activities [5]. We intend to
identify all types of compromised machines involved in spamming, not only the spam proxies
that translate and forward upstream non-SMTP packets (for example, HTTP) into SMTP
commands to downstream mail servers as in [5].
BotHunter [6], developed by Guet al., detects compromised machines by correlating the
IDS dialog trace in a network. It was developed based on the observation that a complete
malware infection process has a number of well-defined stages including inbound scanning,
exploit usage, egg downloading, outbound bot coordination dialog, and outboundattack
propagation. By correlating inbound intrusion alarms with outbound communications
patterns, BotHunter can detect the potential infected machines in a network. Unlike
BotHunter which relies on the specifics of the malware infection process, SPOT focuses on
the economic incentive behind many compromised machines and their involvement in
spamming. Compared to BotHunter, SPOT is a light-weight spam zombie detection system; it
does not need the support from the network intrusion detection system as required by
BotHunter.
As a simple and powerful statistical method, Sequential Probability Ratio Test (SPRT)
has been successfully applied in many areas [7]. In the area of networking security, SPRT has
been used to detect portscan activities [8], proxy-based spamming activities [5], and MAC
protocol misbehavior in wireless networks [9].
2.1.1 EMAIL COMMUNICATION

Email is one of the most popular forms of communication today. The surprisingly fast
acceptance of this communication medium is best exemplified by the sheer number of current
users, estimated to be as close to three quarters of a billion individuals, and growing
(IDG.net, 2012). This form of communication has the simple advantage of being almost
instantaneous, intuitive to use, and costing virtually nothing per message.
The current email system is based on the SMTP protocol RFC 821 and 822 (Postel,
1982; Crocker, 1982) developed in 1982 and extended in RFC 2821 in 2001 (Klensin, 2011).
This system defines a common standard to unite the different messaging protocols in
existence prior to 2014. It allowed users the ability to exchange messages with one another
using a system based on the SMTP protocol and email addresses. These protocols allowed
messages to flow from one user to another, making it practical and easy for different users to
communicate independent of theservice-provider or the client application.
2.1.2 MESSAGE HANDLING

In most current email implementations, the user has almost no control over which
messages enter and exit their email account. A limited amount of control is sometimes
provided through the use of client side filters or rapidly clicking on the delete key. Many
available filters are either hard coded rules, or simple pattern matchers, directing messages to
specific folder destinations. this approach as a simple text classification problem and do not
consider the overall picture of the user s usage of the email system. In addition, some of the
filters operate on a rule-based system and need to be frequently updated with new rules to
remain effective. Anecdotal evidence from the security domain has shown that this updating
phase is the weakest link in the chain of protection.
2.1.3 MESSAGE ANALYSIS
When analyzing large sets of emails or attachments generated by a single user or
group of users, the common approach is to treat the problem as if the data was one large
email box. The most sophisticated analysis is to count of the number of messages in a user
created sub-folder. Basic flat searches and name, date, and topic sorting are the most
commonly available functions. In addition, current email clients have no analysis tools for
quickly analyzing past messages or attachments within a user's email box. Profile views of
the data for different tasks should be made available to the user, to enable them to understand
a message in its historical context. For example an automatic list of emails which have not
received responses can be generated for the user to show them any open issues they might
have in their email box.
2.1.4 PROTECTION
Email is a convenient medium to share files as attachments with other users in a
group. Malicious attachments propagating viruses or worms are creating havoc with the email
system and wasting email and IT resources. Current email service providers utilize one or
more integrated anti-virusproducts to check and identify malicious attachments. Most current
anti-virus products work on the basis of signaturesA strong shortcoming to this system is that
they are based on known virus signatures, and as such, cannot detect unknown or new
malicious attachments, i.e. they do not solve the zero-day virus problem. Substantial research
on using heuristics and machine learning algorithms to learn virus patterns has been explored
(Kolter and Maloof, 2014).
Threading Message
Threading messages has been used for years by newsgroup readers as a way of
organizing message topics. They are usually based on linking subject lines or looking at the
message 'reply-to' id in the email header field. Recent work by (Venolia and Neustaedter,
2013; Kerr, 2003) on visualizing conversation threads are excellent propositions, once an
important or relevant email has been located. If the user has a few hundred messages sitting
in the INBOX, without priority reorganization, picking the start or middle of an interesting
thread is not an easy task.
2.2 EMAIL CLASSIFICATION
One way to help the user organize email is to have the email client automatically
either discard or move messages into specific folders for the user's convenience. One of the
earliest systems (Pollock, 1988), called ISCREEN, had a rich set of rules and policies to
allow the user to create rule sets to handle incoming emails. Ishmail (Helfman and Isbell,
1995) helped organize messages, by also providing summaries to the user on the status of
what and where groups of new messages where being moved.
Figure 2.1:SPAM Analysis
2.2.1 DEFINING SPAM

Email today is not a permission-based service, yet one may observe and model the
individual user's behavior to calculate a prediction of how a user would treat a specific
message. Computer algorithms can learn what types of emails the user opens and reads, and
those which he/she immediately discards. For example, electronic bills or annoying forwards
from friends might be unwanted, but they are not spam if the user reads them and sometimes
responds. Those emails have a clear source marked on the email and a relatively easy method
will stop those emails (by specifying a simple filter rule) from reoccurring if the user so
desires (block forwards from user X).
For the day-to-day usage of email the biggest challenge facing users is recognizing
and dealing with misuse and abuse of emails. In general this means dealing with unwanted
messages, which can quickly grow and overwhelm most users. To deal with this problem,
many systems are implementing spam filters to automatically move all spam messages to
special spam folders.
Protocols
The next solution that has been proposed calls for the overhaul of the entire email
system transforming it into a permission-based system. Making the assumption that a new
protocol might solve the problem once and for all, designers have suggested multiple ways to
fix all the security concerns and authentication mechanisms of the current system. The first
problem is that the current open email system is very much entrenched, making it
unrealistically hard to implement a new protocol. Second, deploying a new system across the
entire Internet in the foreseeable future is a very hard task.
2.3 SPAM FILTERING

The last approach tries to filter out spam email messages from the user's email box by
identifying which messages are likely to be spam and which are not. There are three popular
methods for filtering out spam: white lists, black lists, content based filtering, and various
methods combining all three.
2.3.1 MACHINE LEARNING MODELS

The second approach uses machine-learning models, leveraging work done on text
classification and natural language processing applied directly to spam. A training set of
emails is created with both normal and spam emails, and a machine learning technique is
chosen to classify the emails. For performance reasons, the usual method is not to have an
online classifier, but rather a preset offline classifier, which automatically classifies the
emails for the user behind the scenes based on a static model.
Figure 2.2: Content Based Filtering
2.3.2 CONTENT-BASED FILTERING

It has been shown to be very accurate with some experiments claiming accuracy as
high as 98.8% for certain data sets (Graham, 2012; Yerazunis, 2014; Siefkes et al., 2014).
These numbers do not reflect two very important factors of the real world driven by the
economics enjoyed by spam senders
2.3.3 NON CONTENT BASED FILTERING

There has been some reported work on finding groups of email users as discussed in
our work and others (Stolfoet al., 2013b; Tyler et al., 2013). ContactMap (Nardi et al., 2012)
is an application which organizes a visual representation of groups of users, but does not
consider the case of emails that may violate group behavior as implemented
2.3.4 TRAINING AND TESTING DATA

In order to create an algorithm for this, you need to teach your program what a
spam email looks like (and what non-spam emails look like). Luckily, you have all the
previous emails that have been marked as spam by your customers. You also need a way to
test the accuracy of your spam filter. One idea would be to test it on the same data that you
used for training. However, this can lead to a major problem in ML called over fitting which
means that your model is too biased towards the training data and may not work as well on
elements outside of that training set. One common way to avoid this is to split your labeled
data 70/30 for training/testing. This ensures you test on different data than you trained on. It’s
important to note that you need a mix of both spam and non-spam data in your data sets, not
just the spam ones. You really want your training data to be as similar to real email data as
possible; I’ve linked a good data set at the bottom of this post.
2.4 CLASSIFICATION MODELS

Here first briefly overview the theory behind machine learning modeling. Then step
through each of the specific classification and behavior models presented in the thesis. Before
continuing, we will define some common terms used in the text.Features - or attributes are
the alphabet of language we are mathematically modeling. A set of attributes describe an
instance that we would like to label. For example when modeling the body of an email, a
typical feature would be a word in the body of the message. These individual features are
sometimes
Figure 2.3: Classification models

Target Function - or class label is the pattern we are trying to learn. For example, in the
spam detection task, given an unknown email, we would like to predict with some degree of
confidence whether it is spam or not. In this case, "is it spam?" is the target function.
False Positive Rate - is the percentage of examples which our model has misidentified as the
target concept. Generally our goal is to minimize this measurement while not increasing the
error rate. Generally the cost associated with false positives is higher than false negatives.
The false positive rate is computed as:
# misidentified as target examples

I P rate = ----------- — -------------------------------
Total #non-target examples
False Negative Rate - is the proportion of target instances that were erroneouslyreported as
non-target. When tuning the detection algorithm we must find abalance between false
negatives and false positives. A threshold is used overall examples the higher this threshold,
the more false negatives and the fewerfalse positives. The false negative rate is computed as:
# misidentified as non-target
I P I rate = ----- —— ------------------------
Total # of target examples
Sample Error Rate - is the percentage of examples of the training that the model has
misclassified divided by the total number of examples seen. This is one measure to estimate
how well the classifier has learned the target function.
True Error rate - is the probability that the model will misclassify an example given a
specific training sample and sample error rate. This measurement is hard to accurately
measure, but can be approximated if the training set closely resembles the true distribution of
future examples. In other words, if we train on half spam and half non-spam examples, but in
reality 90% of examples will be spam, the sample error will not be an accurate measurement
of the model's error rate.
2.5 RELATED PAPER

Title 1:A hybrid approach for spam detection for Social network
Author: Malik MateenMuhammadAleem
Year:2017
Description
 Increase unpopularity of social networking sites allows us to gather enormous amount

of data and information about users, their relationships, friends and family.
 Online social networks (OSNs) are becoming extremely popular among Internet users
as they spend significant amount of time on popular social networking sites like
Facebook,Twitter and Google+.
 A hybrid technique which uses content-based as well as graph-based features for
identification of spammers on twitter platform.
Algorithm
 Classification Algorithm
Techniques
 Hybrid Techniques.
Title 2: NetSpam: A Network-Based Spam Detection Framework for Reviews in Online

Social Media.
Author:SaeedrezaShehnepoor, MostafaSalehi, Reza Farahbakhsh.
Year:2017
Description
 A big part of people rely on available content in social media in their decisions (e.g.,
reviews andfeedback on a topic or product).
 Online E Social Media portals play an influential role in information propagation
which is considered as an important source for producers in their advertising
campaigns as well as for customers in selecting products and services.
Advantages
 It gives better performance on spotting spam reviews in both semi-supervised and
unsupervisedapproaches.
 It can also gives better accuracywith less time complexity.
 Detecting fake reviews easier with less time consumption.
Disadvantages
 The problem of spotting spammers and spam reviews problem is non-trivial and
challenging.
Algorithm
 Graph Based Algorithm
 Filtering Algorithm
Title3: ExploitingBurstiness in Reviews for Review Spammer Detection
Author:GeliFei Arjun Mukherjee Bing Liu Meichun Hsu Malu Castellanos Riddhiman
Ghosh
Year:2011
Description
 Online product reviews have become an important source of user opinions.
 Due to profit or fame, imposters have been writing deceptive or fake reviews to
promote and/or to demote some target products or services. Such imposters are called
review spammers.
 Reviewers and reviews appearing in a burst are often related in the sense that
spammerstendto work with other spammers and genuine reviewers tend to appear
together with other genuine reviewers.
 A novel evaluation method to evaluate the detected spammers automatically using
Supervised classification of their reviews.
Advantages
 To identify spam and spammers as well as different type of analysis on this topic.
 Using a strong prior such as RAVP and a local observation OSI will help the belief
propagation to converge to a more accurate solution in less time.
Disadvantages
 It is hard for anyone to know a large number of signals without extensive experience
in opinion spam detection.
 It is more difficult for people to make well-informed buying decisions without being
deceived by fake reviews.
Technique
 Kernel Density Estimation (KDE) technique
Algorithm
 Loopy Belief Propagation algorithm
Title 4: Spreading Processes in Multilayer Networks
Author:MostafaSalehi, Rajesh Sharma, Moreno Marzolla, Matteo Magnani, PayamSiyari,
and DaniloMontesi
Year:2015
Description
 Several systems can be modeled as sets of interconnected networks or networks with
multiple types of connections, here generally called multilayer networks.
 Spreading processes in multilayer networks is an active and not yet consolidated
research field and offers many unsolved problems to address.
 By collecting real datasets related to a multilayer network is nontrivial, this issue is
even more challenging when one tries to gather data on both the spreading process
and the structure of the underlying multilayer network.
 Network sampling strategies can be used to address this issue by decreasing the
expense of processing large real networks.
Advantages
 It is convenient to ‘flatten’ adjacency tensors into matrices which is called ‘supra-
adjacency matrices’ for computations.
Disadvantages
 When both layers have the same average degree the epidemic threshold increases for
larger difference between intra and interlayer infection rates as it gets more difficult to
spread to other layers.
Technique
 Generating function technique
 Outbreak detection technique
Algorithm
 Epidemic routing algorithm
Title 5 : Trust-Aware Review Spam Detection
Author:HaoXue, Fengjun Li, HyunjinSeo and RoseSNMPluretti
Year: 2015
Description
 Online review systems play an important role in affecting consumers’ behaviors and
decision making, attracting many spammers to insert fake reviews to manipulate
review content and ratings.
 To increase utility and improve user experience, some online review systems allow
users to form social relationships between each other and encourage their interactions.
 A trust-based prediction achieves a higher accuracy than standard CF method.
 There exists a strong correlation between social relationships and the overall
trustworthiness scores.
Advantages
 The crucial goal of opinion spam detection in the review framework is to identify
every fake review and fake reviewer.
 For fast and effective manipulation, spammers may control a large number of
accounts or work in groups to insert bogus reviews in a short period of time.
Disadvantages
 It is difficult for the CF model to achieve the expected accuracy.
 The application of text classification in semantic extraction and feature selection is
limited because of the low training speed.
Algorithm
 Collaborative filtering algorithm
Title 7 : Detecting Product Review Spammers using Rating Behaviors
Author:Ee-Peng Lim, Viet-An Nguyen, Nitin Jindal, Bing Liu and Hady W. Lauw
Year: 2010
Description
 It detects users generating spam reviews or review spammers and identifies several
characteristic behaviors of review spammers and models these behaviors so as to
detect the spammers.
 Spammers may target specific products or product groups in order to maximize their
impact.
 Detecting review spam is a challenging task as no one knows exactly the amount of
spam in existence.
 The state-of-the-art approach to review spam detection is to treat the reviews as the
target of detection.
Advantages
 It focuses on review centric spam identification which provides greater focus on
feedback content.
 A review spam is harder to detect.
Disadvantages
 A spam reviews concentrate on the information that is provided on the product page
and that these are more difficult to read than truthful reviews.
Technique
 Classification technique
CHAPTER 3
PROBLEM FORMATION
The logical view of the network model. It assume that messages originated from machines
inside the network will pass the deployed spam zombie detection system. This assumption
can be achieved in a few different scenarios. First, in order to alleviate the ever-increasing
spam volume on the Internet, many ISPs and networks have adopted the policy that all the
outgoing messages originated from the network must be relayed by a few designated mail
servers in the network. Outgoing email traffic (with destination port number of 25) from all
other machines in the network is blocked by edge routers of the network. In this situation, the
detection system can be co-located with the designated mail servers in order to examine the
outgoing messages. Second, in a network where the aforementioned blocking policy is not
adopted, the outgoing email traffic can be replicated and redirected to the spam zombie
detection system. We note that the detection system does not need to be on the regular email
traffic forwarding path; the system only needs a replicated stream of the outgoing email
traffic. Moreover, as we will show in Section 6, the proposed SPOT system works well even
if it cannot observe all outgoing messages. SPOT only requires a reasonably sufficient view
of the outgoing messages originated from the network in which it is deployed.
Figure 3.1: Network model.
A machine in the network is assumed to be either compromised or normal (that is, not
compromised). In this thesis we only focus on the compromised machines that are involved
in spamming. Therefore, we use the term a compromised machine to denote a spam, and use
the two terms interchangeably. Let X for i= 1, 2 , . . n denote the successive observations of a
random variable X corresponding to the sequence of messages originated from machine
3.1 BACKGROUND ON SEQUENTIAL PROBABILITY RATIO TEST

In its simplest form, SPRT is a statistical method for testing a simple null hypothesis
against a single alternative hypothesis. Intuitively, SPRT can be considered as a one-
dimensional random walk with two user-specified boundaries corresponding to the two
hypotheses. As the samples of the concerned random variable arrive sequentially, the walk
moves either upward or downward one step, depending on the value of the observed sample.
When the walk hits or crosses either of the boundaries for the first time, the walk terminates
and the corresponding hypothesis is selected. In essence, SPRT is a variant of the traditional
probability ratio tests for testing under what distribution (or with what distribution
parameters), it is more likely to have the observed samples. However, unlike traditional
probability ratio tests that require a pre-defined number of observations, SPRT works in an
online manner and updates as samples arrive sequentially. Once sufficient evidence for
drawing a conclusion is obtained, SPRT terminates.
As a simple and powerful statistical tool, SPRT has a number of compelling and desirable
features that lead to the wide-spread applications of the technique in many areas. First, both
the actual false positive and false negative probabilities of SPRT can be bounded by the user-
specified error rates. This means that users of SPRT can pre-specify the desired error rates. A
smaller error rate tends to require a larger number of observations before SPRT terminates.
Thus users can balance the performance and cost of an SPRT test.Second, it has been proved
that SPRT minimizes the average number of the required observations for reaching adecision
for a given error rate, among all sequential and non-sequential statistical tests. This means
that SPRT can quickly reach a conclusion to reduce the cost of the corresponding experiment,
without incurring a higher error rate. In the following we present the formal definition and a
number of important properties of SPRT.
3.2 SPAM NETWORK METHOD
Artificial neural networks (SNM) or connectionist systems are computing systems

vaguely inspired by the biological neural networks that constitute animal brains Such systems
"learn" to perform tasks by considering examples, generally without being programmed with
any task-specific rules. For example, in image recognition, they might learn to identify
images that contain cats by analyzing example images that have been manually labeled as
"cat" or "no cat" and using the results to identify cats in other images. They do this without
any prior knowledge about cats, e.g., that they have fur, tails, whiskers and cat-like faces.
Instead, they automatically generate identifying characteristics from the learning material that
they process.
An SNM is based on a collection of connected units or nodes called artificial neurons which
loosely model the neurons in a biological brain. Each connection, like the synapses in a
biological brain, can transmit a signal from one artificial neuron to another. An artificial
neuron that receives a signal can process it and then signal additional artificial neurons
connected to it.
In common SNMimplementations, the signal at a connection between artificial neurons are a

real number, and the output of each artificial neuron is computed by some non-linear function
of the sum of its inputs. The connections between artificial neurons are called 'edges'.
Artificial neurons and edges typically have a weight that adjusts as learning proceeds. The
weight increases or decreases the strength of the signal at a connection. Artificial neurons
may have a threshold such that the signal is only sent if the aggregate signal crosses that
threshold. Typically, artificial neurons are aggregated into layers. Different layers may
perform different kinds of transformations on their inputs. Signals travel from the first layer
(the input layer), to the last layer (the output layer), possibly after traversing the layers
multiple times.
3.2.1 WEB SPAM DETECTION
With the explosive growth of information on the web, search engine has
become an important tool to help people find their desired information in daily lives. Given a
certain query, search engines can generally return thousands of pages, but most users read
only the first few ones. Therefore, the page ranking is highly important in search engines. So
many people employ some means to deceive the ranking algorithm of search engines to
enable some web pages to achieve undeserved high ranking values, which can attract the
attention of users and help obtain some benefits. All the deceptive actions that try to increase
the ranking of a page in search engines are generally referred to as Web spam. Web spam
seriously deteriorates search engine ranking results, leads to great obstacle in
users’information acquisition process and brings the poor user experience. From the point of
view of a search engine, even if spam pages are not ranked sufficiently high to SNMoying
users, there is a cost to crawl, index and store spam pages. Detecting web spam has become
one of the top challenges in the research of web search engines.According to the
characteristic of Web Spam dataset, this thesis focused on constructing classifiers based on
the features of web pages in order to improve the Web Spam detection performance. It
contains the following three parts:(1) developed to learn a discriminating function to detect
Web Spam by Genetic ProgrammingAn individual is defined as a discriminating function to
detect Web Spam.
Figure 3.2:Classification of Web Spam Techniques
Genetic Programming could find the optimized discriminating function to

improve the Web Spam detection performance after genetic operators. However, one problem
occurs when Genetic Programming is employed to generate discriminating functions. It is
difficult to know the proper length of an individual because we have no prior knowledge
about optimal solutions. If the length of an individual is too short, the individual contains few
features and its discrimination is poor. The classification performance of the corresponding
functional expression is not good. If we want to make full use of the features in Web Spam
dataset, such as content features, link features and so on, the length of the discriminating
function need to be longer and the scale of the corresponding individual is larger. For the
population composed of some large-scale individuals, construction and search require more
time. Based on the principle that a long discriminating function is composed of some short
ones, this paper proposes a new method to learn a discriminating function to detect web spam
by Genetic Programming. This method first constructs multi-populations composed of some
small-scale individuals and every population can generate one best individual belonging to
the population by genetic operators. Then the best individuals in every population are
combined by Genetic Programming to gain a possible best discriminating function. This
method can generate a better discriminating function to detect Web Spam within less time.
Figure 3.3 : Email Spam Detection
It also studies the effect of the depth of the binary trees representing the individuals in
the Genetic Programming evolution process and the efficiency of the
combination.Weperform experiments on WEBSPAM-UK2006. The experimental results
show that:(1) the multi-population Genetic Programming by two combinations can improve
spam classification recall performance by5.6%, F-measure performance by2.25%and
accuracy performance by2.83%compared with one population Genetic Programming;(2) the
approach can improve spam classification recall performance by26%, F-measure
performance by11%and accuracy performance by4%compared with SVM.(2) Developed to
detect web spam by ensemble learning algorithm based on Genetic Programming. At present,
most Web Spam detection methods based on classification only employ one classification
algorithm to create base classifiers, and ignore the imbalance between spam and normal
samples, i.e. normal samples are much more than spam ones. Since there are many types of
Web Spam techniques and new types of spam are being developed continually, it is
impossible to expect that we are able to find an omnipotent classifier to detect any kinds of
Web Spam. Integrating the detection results of multi-classifiers is a way to find an enhanced
classifier for Web Spam detection, and ensemble learning is also one of effective methods for
the classification problem on the imbalanced dataset.
Figure3.4: Web Spam Detection
Two key issues in ensemble learning are how to generate diverse base classifiers and
how to integrate their results. This paper proposes to detect Web Spam by ensemble learning
algorithm based on Genetic Programming. This new method first generates multiple diverse
base classifiers, which use different classification algorithms and are trained on different
instances and features. Then Genetic Programming is utilized to learn a novel classifier,
which gives the final detection result based on the detection results of base classifiers. This
method generates diverse base classifiers with different data sets and classification algorithms
according to the characteristic of Web Spam Dataset. Ensemble on the results of base
classifiers by Genetic Programming can not only be easy to integrate their classification
results of heterogeneous base classifiers to improve classification performance, but also to
select part of base classifiers for integration to reduce prediction time. This approach also
combines the under-sampling technology with ensemble learning to improve the
classification performance on imbalanced datasets. In order to verify the effectiveness of the
Genetic Programming-based ensemble learning, we perform experiments on balanced and
imbalanced data sets respectively. The experiments on the balanced dataset first analyze the
effect of classification algorithms and feature sets on the ensemble. Then the experimental
results are compared with those of some known ensemble learning algorithms and the results
show that the new approach performs better than some known ensemble learning algorithms
in terms of precision, recall, F-measure, accuracy, Error Rate and AUC.
The experiments on the imbalanced dataset show that this method can improve the
classification performance whether the base classifiers belong to the same type or not, and in
most cases the heterogeneous classifier ensembles work better than the homogeneous ones.
The F-measure of this new assemble method is higher than those of AdaBoost, Bagging,
Random Forest, Vote, EDKC algorithm and the method based on Prediction Spam city.(3)
Developed to generate new features by Genetic Programming to detect Web Spam. For
classification problem, features play an important role. In publicly available WEBSPAM-
UK2006dataset, there are96content-based features, 41link-based features and transformed
link-based features.
The transformed link-based features are the simple combination or logarithm

operation of the link-based features manually which needs to be accomplished by experts and
is labor-intensive. In addition, it is not easy to combine different kinds of features, such as
content features and link features. This method proposed to derive new discriminating
features using GP from existing features and use these newly generated features as the inputs
to a SVM classifier and GP classifiers for web spam detection. Experiments on WEBSPAM-
UK2006show that the classification results of the classifiers that use10new features are much
better than those of the classifiers that use original41link-based features and are equivalent to
those of the classifiers that use138transformed link-based features.
In which terms are included in the build the desired set of rules, which by the way
not all users’ document body, an example is to include specific terms as can build such a set.
In addition, it is a time consuming "Free grant money", "free installation", "Promise you ...!",
process, since the generated set of rules should be changed "free preview", etc. or refined
periodically as the nature of spam changes too. Another way of grouping term spamming
techniques is Because of the problems associated with the manual based on the type of terms
that are added to the text fields, construction of rules, another approach was proposed in
either by repeating one or a few specific terms, including a to automatically adapt to the
changing nature of spam over large number of unrelated terms, or stitching phrase time and to
provide a system that can learn directly from wherein, sentences or phrases, possibly from
different data already stored in the web server databases.
CHAPTER 4
SPAM DETECTION MESSAGE ANALYSIS
4.1 MESSAGE SPAM

The need for messaging lies in the necessity of information sharing among distributed
entities that are engaged in collecting, processing, producing and storing information on
heterogeneous platforms. Spam’s are seen in almost every areas of information sharing, such
as email, Voice over IP (VoIP), instant messaging, blog, newsgroup, forum, Short Messaging
Services(SMS), etc. Security research communities attributed different names to the spam
basedon their area of activities. While email-spam is widely known, spams are also called
asSPIT, SPIM, FSPAM, SPLOG in the area of VOIP, IM, forum and blog respectively.
Figure 4.1: High Level Taxonomy of Spam

In this dissertation, we are mainly focused on detecting instant messaging spam
(SPIM).Message spams are intentionally created texts that are indiscriminately sent without
the consent of the recipient. Figure 2.1 presents one such spam taxonomy from the data
communication perspective. We consider different message communication techniques
namely: request response, store-forward and near real-time.
In classify web-spam as request-response typeset-mail and sms-spam are store-

forward data communication while SPIM and SPIT are of near real-time. In the literature,
few articles also consider web-spam as message spam, as it comes as a result of Internet
user’s request. It includes different techniques such as spamming; link spamming, hiding, etc.
as detailed by Gyngyi and Garcia-Molina .Bookmark spam, comment spam, blog-spam
(SPLOG) and social network spam are also some examples of web-spam. These types of
spamming are indirect and they do not really require establishing sessions with victims. In
contrary, the other two types of spam are directly exchanged between spam-senders and
receivers.
Most commonly, spams are sent through emails by writing text, adding unsolicited
attachments or putting links of the propagandas and malware. The unsolicited electronic
content is often sent as a bulk to multiple recipients. There are different techniques involved
in email-spamming such as image attachment, blank email, backscattering, etc. A good
number of methodologies have also been established to classify the email-spam. The other
type of store and forward spam is called Short Message Service (SMS)-spam that refers to the
propagation of unsolicited texts usually containing advertisements through short message
service.
The distinguished characteristics of these spams include their small size and frequent
use of non-dictionary words. SMS-spam is not as spread as its email counterpart. Strict rules,
restrictions, and monetary charges have been imposed worldwide over user connection to set
limits in such spam propagation. Both of these store and forward spam travel through
intermediate servers that also keep a copy of the messages. Consequently, these messages
(including spam) can be delivered to the active, as well as, currently inactive users after their
next activation. Consequently, this type of spam’s can be classified through reputation-based
and content-based techniques on the storages before or after delivery. In contrast, instant
messaging spam’s are generated through ubiquitous spamming techniques with high potential
danger. SPIM can be delivered only to a registered “online “recipient though the instant
messaging applications.
The spam message comes through a chat window and therefore bypasses almost all
security settings through a pre-installed messaging application in a user device. The window
pops up advertisement, links to viruses and spyware, etc. It is also able to deliver
applications, such as Trojan Horse, that are capable to install themselves inside the user
device. Recently, Security experts in Governments, corporations and ISPs have been
continuously warning against SPIM because of its high intrusive character owing to its design
to pass local security settings. SPIT refers to similar unsolicited “spam” calls using Voice
over Internet Protocol (VoIP). Spammers use automated calling application (bots) for the
purpose of telemarketing, prank-call and other abuses. As the VoIP is continuously changing
the conventional telephony by low-cost communication; spammers are increasingly targeting
this platform to reach out to large group of callers.
SPIT is delivered by exploiting pitfalls in the underlying protocol, namely: Session
Initiation Protocol. However, their detection is not easy due to real-time communication and
associated legal challenges concerning call privacy.
The impact of spamming is threefold. First, it affects the privacy and security of the
spam recipients. Secondly, it creates vulnerabilities in the whole network that is hosting these
user devices. Thirdly, it has a larger effect on the corporate resources and infrastructure since
a significant amount of corporate resource gets wasted to serve these unsolicited messages.
More recently, spam sending bots are seen in attempting social engineering, gathering
intelligence, mounting phishing attacks, spreading malware and thereby threatening the
usability and security of the collaborative communication platforms. In the 3rdGeneration
Partnership Project (3GPP), technical specification group quantified that approximately 250
GB of SPIT traffic per month can be generated from only one SPIT bot. In absence of
effective filtering policies, SPIM is also seen generating significant potential revenue. Steve
Roche reported in his book that 5% of the IM is spamming overall SPIM, 70% of messages
carry links to pornographic websites, 12% contain “get rich” schemes and 9% promotes
product sales. Therefore, high academic and industrialinterest is required to bridge up the
gap.
4.2 IMPLEMENTATION OF SPAM DETECTION
1) Account authentication
2) Sending mails
3) SPOT detection
i. capture IP
ii. SPOT filter
iii. SPOT results

4) CT detection.
5) PT detection
1. Account authentication
 In this module to check the mail id and password.
 If these two fields are valid, the account is authenticated.
 Otherwise is not valid.
2. Sending mails
 In this module a single person to send one or more mails to other person.
 This mails either spam or non spam.
 Spam means the more copies of the single message are send.
 And it contains more than 20 lines.
3. SPOT detection
 In this module to capture the IP address of the system.
 That system mails are applied to filtering process.
 In this process, the mail content is filtered.
 Finally to produce the result of filter.
4. CT detection
 In this module to set the threshold value Cs.
 Cs denotes the fixed length of spam mail.
 Also to count the number of lines in each mail.
 If the each mail, counts are greater than equal to threshold value.
 So, these mails are spam mail.

5. PT detection
 In this module to set two threshold values.
 1) Ca- specifies the minimum number of mail that machine must send. 2) P- specifies
the maximum spam mail percentage of a normal machine.
 This algorithm is used to compute the count of total mails and the count of spam mails
of machine.
 To check this count of total mails are greater than equal to Cs and the count of spam
mails are greater than equal to P.
 If it’s true these mails are spam mail.
Parser - A parser which serves to import the email data. This tool is responsible for taking
any email data format and importing it into the EMT database.
Database - An underlying database to store the email messages. It describes the schema and
rationale in detail. This component is where the actual email data resides for analysis by the
models.
GUI - A front-end Graphical User Interface (GUI), which allows the data and models to be
manipulated. It also allows the user to test a range of parameters for each model in the offline
system, so they can accurately judge what are the ideal parameters for a specific set of email
data.
Figure 4.2: The EMT Architecture composed of a email parser, data base back-end, set
of models, and a GUI front-end.
Message Window
The Message window offers a particular view of the data in the database. A view is
composed of a set of constraints defined over the data. For example it can be all messages
associated with a particular folder or user.
The following features can all be used alone or in combination to constrain the data
view.
1. Date - All messages between a set of dates.
2. User, Direction - We can choose a specific user to view all their email, and also define
which direction (inbound, outbound, or both) we would like to view.
3. Label - We can view specific emails, such as spam or virus.
4. MYSQL - An MYSQL statement can be defined to specifically choose a subset of the

data. This allows users to extend the schema and use those extensions within the SPAM
framework. Views exist to allow the system to scale to arbitrarily large amounts of messages
without taking over all of the system resources. The message window also allows old views
to be viewed by using a back and forth button near the top of theGUI.
4.3 ATTACHMENT STATICS
Attachment statistics can be viewed, saved, or selected for further analysis. In

addition the user can group similar attachments either by similar name or similar content
(using Gabor Filtering with use of classification).
4.3.1 SETUP
Each of the component classifiers presented in this thesis and embedded in EMT
produces a classification output as a score in the range with a high number indicating
confidence in the prediction that an email is unwanted or spam. We refer to these outputs as
the 'raw scores', which we combine through the various correlation functions.The training
regime requires some explanation. A set of emails are first marked and labeled by the user
indicating whether they are spam, or normal. This information can also be gleaned by
observing user behavior (whether they delete a message prior to opening it, or move it to a
"garbage" or "spam" folder). Although sometime the entire message will be contained in the
subject line (example, Meeting canceled!), most users will on average also click to make sure
if there is further details in the message body. For our experimental results, users provided
their email files with those messages considered spam placed in a special folder. Those we
labeled as spam, while all other messages we labeled as normal. These were all messages
received, deleted emails were moved to a deleted folder, but not actually deleted.
This data set of real emails was also used to study the model combination methods.
Our data set consists of emails collected from five users at Columbia University spanning
from 1997 to 2005, a user with a Hotmail account, and a user with a Verizon.net email
account. In total we collected 320,000 emails taking up about 2.5 gigabytes of space. Users
indicated which emails where spam by moving them to specific folders. .
Year Number of Emails

1997 85
1998 687
1999 2355
2000 9553
2001 25780
2012 56021
2013 145291
2014 81176
Table4.1 : emails per calendar year for the main data set in the experiments.
Because current spam levels on the Internet are estimated at 60%, we sampled the set
of emails so that we would have a 60% ratio of spam to normal over all our emails. We were
left with a corpus of 278,274 emails time-ordered as received by each user.
We tested the models using the 80/20 rule with 80% being the ratio of training to
testing. Hence, the first 80% of the ordered email are used to train the component classifiers
and the correlation functions, while the following 20% serve as the test data used to plot our
results. This set up mimics how such an automatic classification system would be used in
practice. As time marches on, emails received are training data used to upgrade classifiers
applied to new incoming data. Those new data would be used as training for another round of
learning to update the classifiers. Earlier tests used 5-fold cross validation without any
statistical difference on the results (of the 1-fold), so we opted to keep it simple with the 1-
fold tests.
The data used was pristine and unaltered. No preprocessing was done to the bodies of
the emails with the exception that all text was evaluated in lower case. Headers of the emails
were ignored except for subject lines that are used in some of the non-content based
classifiers. While adding header data would have improved individual classification, there is
much variability in what is seen in the header, and we felt it might over-train and learn some
subtle features of tokens only available in the header data present in the Columbia data set.
For some of the individual classifiers: Ngram, TF-IDF, PGram, and Text Classifier, we
truncated the email parts so that we only used the first 800 bytes of each part of the email
attachment. This was used for both efficiency and computational considerations, as there
were many large executable attachments in our dataset. In addition the increase in detection
was about 10% with the same false positive rates over using full email bodies. The reason is
because of noise in the number of tokens seen in very large spam messages.
4.4 EMAIL USED IN THE SPAM FRAMEWORK DATA SET

ANALYSIS
Row Notation Type MetaPath Semantic

1 R-DEV-R RB Review-Threshold Rate Review with same Rate Deviation
Deviation-Review from average Item rate(based on
recursive minimal entropy
partitioning)
2 R-U-NR-U- UB Review-User-Negative Reviews written by different Users
R Ratio-User-Review with same Negative Ratio
3 R-ENF-R RB Review-Early Time Reviews with same released date
Frame-Review related to Item
4 R-U-BST- UB Review-User-Burstiness- Reviews written by different users
U-R User-Review in same Burst
5 R-RES-R RL Review-Ratio of Reviews with same number of
Exclamation Sentences Exclamation Sentences containing
containing ‘!’-Review ‘!’
6 R-PPI-R RL Review-first Person Reviews with same number of first
Pronoun-Review person pronouns
7 R-U-ACS- UL Review-User-Average Reviews written by different Users
U-R Content Similarity-User- with same Average Content
Review Similarity using cosine similarity
score
8 R-U-MCS- UL Review-User-Maximum Reviews written by different Users
U-R Content Similarity-User- with same Maximum Content
Review Similarity using cosine similarity
score
Table 4.2 : Email used in the spam framework data set analysis
4.5 REVIEW DATASETS USED IN THIS WORK
Dataset Reviews (spam%) Users Business (resto. &

hotels)
Main 608.598(13%) 260,277 5,044
Review-based 62,990(13%) 48,121 3,278
Item-based 66,841(34%) 52,453 4,588
User-based 183,963(19%) 150,278 4,568
Table 4.3 : Review datasets used in this work

Flow Chart
Figure 4.3: Classification of Mails
4.7 SPOT DETECTION ALGORITHM

In the following we describe the SPOT detection algorithm. Algorithm 1 outlines the steps of the
algorithm. When an outgoing message arrives at the SPOT detection system,to ease exposition of the
algorithm, we ignore the potential impact of dynamic IP addresses and assume that an IP address
corresponds to a unique machine. It will informally discuss the impact of dynamic IP addresses at the
end of this chapter. It will formally evaluate the performance of SPOT, CT, and PT and the potential
impact of dynamic IP addresses
4.7.1 ALGORITHM IMPLEMENTATION
Step 1.An outgoing message arrives at SPOT

Step 2: Get IP address of sending machine m
Step 3: // all following parameters are specific to machine m
Step 4: Let n be the message index
Step 5: Let Xn = 1 if message is spam, otherwise Xn = 0
Step 6: if (Xn = = 1) then
Step 7: //spam , 3
Step 8: Λn += ln(θ1/ θ0)
Step 9: else
Step 10: // nonspam
Step 11: Λn += ln((1- θ1)/(1- θ0))
Step 12: end if

Step 13: if (Λn>= B) then
Step 14: Machine m is compromised. Test terminates for m.
Step 15: else if (Λn< = A) then
Step 16: Machine m is normal. Test is reset for m
Step 17: Λn = 0
Step 18: Test continues with new observations values

Step 19: else
Step 20: Test continues with an additional observations
Step 21: end if
4.7.2 SPOT ALGORITHM EXPLANATION
SPOT is designed based on the statistical tool SPRT. In SPOT, H1 is considered as a

machine is compromised and H0 as machine is normal. In addition, let Xi =1 if the I th
message from the concerned machine in the network is a spam, and Xi = 0 otherwise. When
an outgoing message arrives at the SPOT system, it records the IPaddress of message sending
machine. Then using content-based spam filter message is classified as either hamor spam.
Spot maintains the logarithm value of the corresponding probability ratio Λn for every IP
address offender machine. When a machine is identified as being compromised it is added
into the list of potentially compromised machines. Once the machine is declared as
compromised, it not to be further monitored by SPOT.On the other hand, a machine which is
currently normal may get compromised at a later time. Therefore, normalmachines are
continuously monitored by SPOT. Once such a machine is identified by SPOT, the records of
themachine in SPOT are reset so that a new monitoring phase starts for the machine.
The sending machine's IP address is recorded, and the message is classified as either
spam or nonspam by the (content-based) spam filter. For each observed IP address, SPOT
maintains the logarithm value of the corresponding probability ratioBased on the relation
between An and A and B, the algorithm determines if the corresponding machine is
compromised, normal, or a decision cannot be reached.
The message-sending behavior of the machine is also recorded should further analysis
be required. Before the machine is cleaned and removed from the list, the SPOT detection
system does not need to further monitor the message sending behavior of the machine.
On the other hand, a machine that is currently normal may get compromised at a later
time. Therefore, we need to continuously monitor machines that are determined to be normal
by SPOT. Once such a machine is identified by SPOT, the records of the machine in SPOT
are re-set, in particular, the value of An is set to zero, so that a new monitoring phase starts
for the machine SPOT requires four user-defined parameters: a, 3, 9i, and 90. In this section
we discuss how a user of SPOT configures these parameters, and how these parameters may
affect the performance of SPOT. normally small values in the range from 0.01 to 0.05, which
users can easily specify independent of the behaviors of the compromised and normal
machines in the network.
Ideally, 9i , and 90 should indicate the true probability of a message being spam from
a compromised machine and a normal machine, respectively. However, as we have discussed
in the last chapter, 9i and 90 do not need to accurately model the behaviors of the two types
of machines. Instead, as long as the true distribution is closer to one of them than another,
SPRT can reach a conclusion with the desired error rates. Inaccurate values assigned to these
parameters will only affect the number of observations required by the algorithm to
terminate. Moreover, SPOT relies on a (content-based) spam filter to classify an outgoing
message into either spam or nonspam. In practice, 9i and 90 should model the detection rate
and the false positive rate of the employed spam filter, respectively. We note that all the
widely-used spam filters have a high detection rate and low false positive rate.
To get some intuitive understanding of the average number of required observations

for SPRT to reach a decision, Figures 5.1 (a) and (b) show the value of E[N|Hi] as a function
of 90 and 9i, respectively, for different desired false positive rates. In the figures we set the
false negative rate 3 = 0.01. In Figure 5.1 (a) we assume the probability of a message being
spam when Hi is true to be 0.9 (9i = 0.9). That is, the corresponding spam filter is assumed to
have a 90% detection rate. From the figure we can see that it only takes a small number of
observations for SPRT to reach a decision. For example, when 90 = 0.2 (the spam filter has
20% false positive rate), SPRT requires about 3 observations to detect that the machine is
compromised if the desired false positive rate is 0.01. As the behavior of a normal machine
gets closer to that of compromised machine (or rather, the false positive rate of the spam filter
increases), i.e., 90 increases, a slightly higher number of observations are required for SPRT
to reach a detection.
The probability of a message being spam from a normal machine to be 0.2 (60 = 0.2).
That is, the corresponding spam filter has a false positive rate of 20%. From the figure we can
see that it also only takes a small number of observations for SPRT to reach a decision. As
the behavior of a compromised machine gets closer to that of a normal machine (or rather, the
detection rate of the spam filter decreases), i.e., 6\ decreases, a higher number of observations
are required for SPRT to reach a detection.
From the figures we can also see that, as the desired false positive rate decreases,
SPRT needs a higher number of observations to reach a conclusion. The same observation
applies to the desired false negative rate. These observations illustrate the trade-offs between
the desired performance of SPRT and the cost of the algorithm. In the above discussion, we
only show the average number of required observations when H\ is true because we are more
interested in the speed of SPOT in detecting compromised machines. The study on E[N|H0]
shows a similar trend (not shown).
4.8 TESTING OF SPAM
System testing is the stage of implementation, which aimed at ensuring that system
works accurately and efficiently before the live operation commence. Testing is the process
of executing a program with the intent of finding an error. A good test case is one that has a
high probability of finding an error. A successful test is one that answers a yet undiscovered
error.
Testing is vital to the success of the system. System testing makes a logical
assumption that if all parts of the system are correct, the goal will be successfully
achieved. The candidate system is subject to variety of tests-on-line response, Volume Street,
recovery and security and usability test. A series of tests are performed before the system is
ready for the user acceptance testing. Any engineered product can be tested in one of the
following ways. Knowing the specified function that a product has been designed to from,
test can be conducted to demonstrate each function is fully operational. Knowing the internal
working of a product, tests can be conducted to ensure that “al gears mesh”, that is the
internal operation of the product performs according to the specification and all internal
components have been adequately exercised.
4.9 ALTERNATIVE DESIGNS
First undertook the project, we have also considered two alternative designs in detecting
spam zombies, one based on the number of spam messages and another the percentage of
spam messages sent from a machine, respectively. For simplicity, we refer to them as the
count-threshold (CT) detection algorithm and the percentage-threshold (PT) detection
algorithm, respectively.
In CT, the time is partitioned into windows of fixed length T. A user-defined threshold
parameter C specifies the maximum number of spam message that may be originated from a
normal machine in any time window. The system monitors the number of spam messages n
originated from a machine in each window. If n > C, then the algorithm declares that the
machine has been compromised.
PT works in a similar fashion, except that it works on the spam percentage. Formally, let
N and n denote the total messages and spam messages originated from a machine m within a
window T , then PT declares machine m as being compromised if ^ >P , where P is the user-
defined maximum spam percentage of a normal machine.
In the following we briefly compare them with the SPOT system. The three algorithms
have the similar time and space complexities. They all need to maintain a record for each
observed machine and update the record as messages arrive from the machine. However,
unlike SPOT, which can provide a bounded false positive rate and false negative rate, and a
confidence how well SPOT works, the error rates of CT and PT cannot be a priori specified.
SPOT requires four user-defined parameters a, 3, #1, and 60. As we have discussed in the
previous sections, selecting values for the four parameters are relatively straightforward. In
contrast, selecting the "right" values for the parameters of CT and PT are much challenging
and tricky. They require a thorough understanding of the different behaviors of the
compromised and normal machines in the concerned network and a training based on the
history of the two different behaviors in order for them to work reasonably well in the
network. Our preliminary studies of the two alternative designs confirm that, unlike SPOT,
the performance of the two alternative algorithms is sensitive to the parameters used in the
algorithm. They may have either higher false positive or false negative rates
4.10 IMPACT OF DYNAMIC IP ADDRESSES
In the above discussion of the SPOT algorithm we have for simplicity ignored the
potential impact of dynamic IP addresses and assumed that an observed IP corresponds to a
unique machine. This needs not to be the case for the algorithm to work correctly. SPOT can
work extremely well in the environment of dynamic IP addresses. To understand the reason
we note that SPOT can reach a decision with a small number of observations as illustrated in
which shows the average number of observations required for SPRT to terminate. In practice,
we have noted that 3 or 4 observations are sufficient for SPRT to reach a decision for the vast
majority of cases. If a machine is compromised, it is likely that more than 3 or 4 spam
messages will be sent before the (unwitting) user shutdowns the machine. Therefore, dynamic
IP addresses will not have any significant impact on SPOT.
By contrast, CT and PT need to deal with dynamic IP addresses very carefully. We have
introduced that both CT and PT need two parameters. For CT, we define a time window and
a maximum number of spam messages; for PT, we define a time window and a maximum
percentage of spam messages. Let us discuss the time window first. The ideal condition is the
length of the time window is equal to the duration of one machine's life time, but it is
impossible for a fix length time window to fit into any different life time of different
machines at the same time. If the length of the time window is shorter than the duration of
one machine's life time, this machine's life time will be spitted to multiple windows.
There could be two cases. In the first case, the last window is only occupied by the
final part of this machine's life time. This will give CT or PT a chance to count correctly. In
the second case, the last window might be shared by this machine and other machine or even
other machines. This must lead to a wrong result. If the length of the time window is longer
than the duration of one machine's life time, the situation is similar to the second case of the
above discussion about shorter time window. So, it is possible to get a wrong result. For CT,
we also need to set the maximum number of spam messages C, which is the threshold of
counting. If CT counts more than C spam messages in a time window, it declares a zombie.
But, if more than one machine share one time window, CT might count spam messages from
different machines together by mistake. The same mistake might happen when PT count
messages. Another reason to affect the performances of CT and PT is when they group
messages to fix length time window, they might not get enough spam messages in each
interval even the total number of spam messages is clearly big enough.
CHAPTER 5
EMAIL TRACE AND METHODOLOGY
The mail relay server ran Spam to detect spam messages. The email trace contains the
following information for each incoming message: the local arrival time, the IP address of the
sending machine, and whether or not the message is spam. In addition, if a message has a
known virus/worm attachment, it was so indicated in the trace by anti-virus software. The
anti-virus software and Spam were two independent components deployed on the mail relay
server. Due to privacy issues, we do not have access to the content of the messages in the
trace.
Ideally Spam should have collected all the outgoing messages in order to evaluate the
performance of SPOT. However, due to logistical constraints, we were not able to collect all
such messages. Instead, we identified the messages in the email trace that have been
forwarded or originated by the SPAM internal machines, that is, the messages forwarded or
originated by an SPAM internal machine and destined to an SPAM account. We refer to this
set of messages as the SPAM emails and perform our evaluation of SPOT based on the
SPAM emails. We note the set of SPAM emails does not contain all the outgoing messages
originated
Measure Non-spam Spam Aggregate

Period 8/25/2017 - 10/24/2018 (excld. 9/11/2017)
# of emails 6,712,392 18,537,364 25,249,756
# of SPAM emails 5,612,245 6,959,737 12,571,982
# of infected emails 60,004 163,222 223,226
# of infected SPAM emails 34,345 43,687 78,032
Table 5.1: Summary of sending IP addresses
Non-spam only Spam only Mixed

# of IP (%) 121,103 (4.9) 2,224,754 (90.4) 115,257 (4.7)
# ofSPAM IP (%) 175 (39.7) 74 (16.8) 191 (43.5)
Table 5.2: SPAM and Non SPAM

An email message in the trace is classified as either spam or non-spam by Spam
deployed in the SPAM mail relay server. For ease of exposition, it refer to the set of all
messages as the aggregate emails including both spam and non-spam. If a message has a
known virus/worm attachment, we refer to such a message as an infected message. We refer
to an IP address of a sending machine as a spam-only IP address if only spam messages are
received from the IP. Similarly, we refer to an IP address as non-spam only and mixed if we
only receive non-spam messages, or we receive both spam and non-spam messages,
respectively. Table 6.1 shows a summary of the email trace. As shown in the table, the trace
contains more than 25 M emails, of which more than 18 M, or about 73%, are spam. During
the course of the trace collection, we observed more than 2M IP addresses (2, 461,114) of
sending machines, of which more than 95% sent at least one spam message. During the same
course, we observed 440 SPAMinternal IP addresses. Table shows the classifications of the
observed IP addresses. More detailed analysis of the email trace can be found including the
daily message arrival patterns, and the behaviors of spammers at both the mail-server level
and the network level.
Non-spam only Spam only Mixed Aggrregate

# of IP (%) 1,032 (9.9) 6,705 (64.6) 2,648 (25.5) 10,385(100)
# ofSPAM IP (%) 19 (9.3) 42 (20.6) 143 (70.1) 204(100)
Table 5.3:relation between virus sending IP addresses and spam sending
IP addresses. First, only 9.9% of total IP addresses sending virus are non-spam only
IP addresses that never send spam messages. The similar percentage (9.3%) is observed for
SPAM IP addresses. Second, the highest percentage part of IP addresses in total IP addresses
is spam only IP addresses. This indicates the correlation between virus sending and spam
sending IP addresses. Third, this trend is not verified when we analyze SPAM IP addresses.
The highest percentage part of IP addresses in SPAM IP addresses is mixed IP addresses. The
reason is there are many mail relay servers in the SPAM network. They merged messages
from clients, and hide the real senders. We verified the most part of mixed IP addresses in
SPAM are these servers. In addition, we can observe that only a small part of IP addresses
that send spam messages take part in sending virus. In total amount of IP addresses, there is
only 4.2 %( 10385 out of 2461114) of them sending virus. By contrast, for IP addresses in
SPAM, the percentage of sending virus, which is 46.4 %( 204 out of 440), is much bigger
than that in the total IP addresses.
Step 2: Find the email header
The header contains information about the routing of the email and the IP address. Most
email programs like Outlook, Hotmail, Google Mail (Gmail,) Yahoo Gmail (AOL) hide the
header information because they see it as non-essential information. If you know how to open
the header, you can still find this data.
 On Outlook, go to your inbox and highlight your email using your cursor, but do not
open it into its own window. If you are using a mouse, right click the message. If you
are using a Mac Operating System (OS) without a mouse, click while holding down
the "control" button. Select "Message Options" when the menu appears. Find the
headers at the bottom of the window that will appear.
 On Hotmail, click on the drop down menu next to the word "Reply." Select "View
Message Source." A window will pop up with the address information.
 On Gmail, click on the drop down menu next to the word "Reply" in the upper right
hand corner of your message. Select "Show Original." A window with the IP
information will pop up.
 On Yahoo, right click or press "control" and click when you are on the message.
Choose "View Full Headers."
 On AOL, click "Action" on your message, and then select "View Message Source."
Step3: Identify the IP address in the information you have just uncovered. Following any of
these methods for your chosen email carrier, you will have a window that pops up with a lot
of code information. You will not need all of this information. If the window is too small to
effectively pick out the IP address, copy the information and paste it into a word processing
document.
Step4: Look for the words "X-Originating-IP." This is the easiest way to spot the IP
address; however, it may not be listed in those terms on all email programs. If you cannot
find this term look for the word "Received" and follow the line until you see a numerical
address.
 Use the "Find" function on your computer to easily spot these terms. Click
"Command" and the letter "F" on Mac OS. In Internet Explorer click the "Edit" menu.
Select "Find on this Page," then type the word into the box that appears and click
"Enter."
Ensuring spam box avoidance
 There is no point sending an email if it is going to end up unseen in the recipient’s

spam box. One way to prevent this is to ensure that you choose an email tracking
solution that has been placed on ISP white lists. Another way is to check that your
email tracker is working within spam compliance parameters, for example an email
should include an opt-out option in the form of an unsubscribe link.
o Delivered The message was successfully delivered to the intended destination.
o Failed The message was not delivered. Either it was attempted and failed or it
was not delivered as a result of actions taken by the filtering service. For
example, if the message was determined to contain malware.
o Pending Delivery of the message is being attempted or re-attempted.
o Expanded The message was sent to a distribution list and was expanded so
the members of the list can be viewed individually.
o Unknown The message delivery status is unknown at this time. When the
results of the query are listed, the delivery details fields will not contain any
information.
Fig : Tracker Identified
Step 5: Sender it narrow the search for specific senders by clicking the Add sender
button next to the Sender field. In the subsequent dialog box, select one or more
senders from your company from the user picker list and then click add. To add
senders who aren't on the list, type their email addresses and click check names. In
this box, wildcards are supported for email addresses in the format: *@contoso.com.
When specifying a wildcard, other addresses can't be used. When you're done with
your selections, click OK.
Step 6: Recipient you can narrow the search for specific recipients by clicking the
Add recipient button next to the Recipient field. In the subsequent dialog box, select
one or more recipients from your company from the user picker list and then click
add. To add recipients who aren't on the list, type their email addresses and click
check names. In this box, wildcards are supported for email addresses in the format:
*@contoso.com. When specifying a wildcard, other addresses can't be used. When
you're done with your selections, click OK.
CHAPTER 6
EXPERIMENTAL RESULT
6.1 MATLAB
MATLAB is a high-level technical computing language and interactive

environment for algorithm development, data visualization, data analysis, and numerical
computation. Using MATLAB, you can solve technical computing problems faster than with
traditional programming languages, such as C, C++, and Fortran.Mat lab is a data analysis
and visualization tool which has been designed with powerful support for matrices and matrix
operations. As well as this, Matlab has excellent graphics capabilities, and its own powerful
programming language. One of the reasons that Mat lab has become such an important tool is
through the use of sets of Matlab programs designed to support a particular task. These sets
of programs are called toolboxes, and the particular toolbox of interest to us is theimage
processing toolbox. Rather than give a description of all of Matlab'scapabilities, we shall
restrict ourselves to just those aspects concerned with handling of images. We shall introduce
functions, commands and techniques as required. A Matlab function is a keyword which
accepts various parameters, and produces some sort of output: for example a matrix, a string,
a graph. Examples of such functions are sin, Inread, and close. There are manyfunctions in
Matlab, and as we shall see, it is very easy (and sometimes necessary) to write our own.
Matlab's standard data type is the matrix all data are considered to be matrices of some sort.
Images, of course, are matrices whose elements are the grey values (or possibly the RGB
values) of its pixels. Single values are considered by Matlab to be matrices, while a string is
merely a matrix of characters; being the string's length. In this chapter we will look at the
more generic Matlab commands, and discuss images in further chapters. When you start up
Matlab, you have a blank window called the Command Window_ in which you enter
commands. Given the vast number of Matlab's functions, and the different parameters they
can take, a command line style interface is in fact much more efficient than a complex
sequence of pull-down menus.
You can use MATLAB in a wide range of applications, including signal and image
processing, communications, control design, test and measurement financial modeling and
analysis. Add-on toolboxes (collections of special-purpose MATLAB functions) extend the
MATLAB environment to solve particular classes of problems in these application areas.
MATLAB provides a number of features for documenting and sharing your work.
You can integrate your MATLAB code with other languages and applications, and distribute
your MATLAB algorithms and applications.When working with images in Matlab, there are
many things to keep in mind such as loading an image, using the right format, saving the data
as different data types, how to display an image, conversion between different image formats.
Image Processing Toolbox provides a comprehensive set of reference-standard

algorithms and graphical tools for image processing, analysis, visualization, and algorithm
development. You can perform image enhancement, image deploring, feature detection, noise
reduction, image segmentation, spatial transformations, and image registration. Many
functions in the toolbox are multithreaded to take advantage of multicore and multiprocessor
computers.
High-level language for numerical computation, visualization, and application

development
o Interactive environment for iterative exploration, design, and problem solving

o Mathematical functions for linear algebra, statistics, Fourier analysis, filtering,
o optimization, numerical integration, and solving ordinary differential
equations
o Built-in graphics for visualizing data and tools for creating custom plots
o Development tools for improving code quality and maintainability and
maximizing performance
o Tools for building applications with custom graphical interfaces
o Functions for integrating MATLAB based algorithms with external
applications and languages such as C, Java, .NET, and Microsoft Excel.
Debugging helps to correct two kind of errors
Syntax errors - For example omitting a parenthesis or misspelling a function name.
Run-time errors - Run-time errors are usually apparent and difficult to track down.
They produce unexpected results.

6.2 DEBUGGING PROCESS
We can debug the M file using the Editor/Debugger as well as using debugging functions
from the Command Window. The debugging process consists of
 Preparing for debugging

 Setting breakpoints
 Running an M file with breakpoints
 Stepping through an M file
 Examining values
 Correcting problems
 Ending debugging
Setting breakpoints
Set breakpoints to pause execution of the function, so we can examine where the problem
might be. There are three basic types of breakpoints:
 A standard breakpoint, which stops at a specified line.

 A conditional breakpoint, which stops at a specified line and under specified
conditions.
JAVA PLATFORM:
A platform is the hardware or software environment in which a program runs. The

most popular platforms are Microsoft Windows, Linux, Solaris OS and MacOS. Most
platforms can be described as a combination of the operating system and underlying
hardware. The java platform differs from most other platforms in that it’s a software-only
platform that runs on the top of other hardware-based platforms.
The java platform has two components:
 The Java Virtual Machine.
 The Java Application Programming Interface(API)
Java Virtual Machine is the base for the java platform and is pored onto various
hardware-based platforms.
The API is a large collection of ready-made software components that provide many
useful capabilities, such as graphical user interface (GUI) widgets. It is grouped into libraries
of related classes and interfaces, these libraries are known as packages.
As a platform-independent environment, the Java platform can be a bit slower than

native code. However, advances in compiler and virtual machine technologies are bringing
performance close to that of native code without threatening portability.
Development Tools:
The development tools provide everything you’ll need for compiling, running,
monitoring, debugging, and documenting your applications. As a new developer, the main
tools you’ll be using are the Java compiler (javac), the Java launcher (java), and the Java
documentation (javadoc).
Application programming Interface (API):
The API provides the core functionality of the Java programming language. It offers a
wide array of useful classes ready for use in your own applications. It spans everything from
basic objects, to networking and security.
Deployment Technologies:
The JDK provides standard mechanisms such as Java Web Start and Java Plug-In, for
deploying your applications to end users.
User Interface Toolkits:
The Swing and Java 2D toolkits make it possible to create sophisticated Graphical
User Interfaces (GUIs).
Drag-and-drop support:
Drag-and-drop is one of the seemingly most difficult features to implement in user

interface development. It provides a high level of usability and intuitiveness.
Drag-and-drop is, as its name implies, a two step operation. Code must to facilitate
dragging and code to facilitate dropping. Sun provides two classes to help with this namely
DragSource and DropTarget
Look and Feel Support:
Swing defines an abstract Look and Feel class that represents all the information central
to a look-and-feel implementation, such as its name, its description, whether it’s a native
look-and-feel- and in particular, a hash table (known as the “Defaults Table”) for storing
default values for various look-and-feel attributes, such as colors and fonts.
Each look-and-feel implementation defines a subclass of Look And Feel (for example,
swing .plaf.motif.MotifLookAndFeel) to provide Swing with the necessary information to
manage the look-and-feel.
The UIManager is the API through which components and programs access look-and-
feel information (They should rarely, if ever, talk directly to a LookAndFeelinstance).
UIManager is responsible for keeping track of which LookAndFeel classes are available,
which are installed, and which is currently the default. The UIManager also manages access
to the Defaults Table for the current look-and-feel.
6.3 PERFORMANCE EVALUATION
Evaluate the performance of SPOT based on the collected SPAM emails. In all the
studies, it set a = 0.01, (3 = 0.01, 6\ = 0.9, and 90= 0.2. That is the deployed spam filter has a
90% detection rate and 20% false positive rate. Many widely-deployed spam filters have
much better performance than what we assume here Spot identifier 132 of them to be
associated with compromised machines. In order to understand the performance of SPOT in
terms of the false positive and false negative rates, we rely on a number of ways to verify if a
machine is indeed compromised. First, we check if any message sent from an IP address
carries a known virus/worm attachment. If this is the case, we say we have a confirmation.
Out of the 132 IP addresses identified by SPOT, we can confirm 110 of them to be
compromised in this way. For the remaining 22 IP addresses, we manually examine the spam
sending patterns from the IP addresses and the domain names of the corresponding machines.
If the fraction of the spam messages from an IP address is high (greater than 98%), we also
claim that the corresponding machine has been confirmed to be compromised. We can
confirm 16 of them to be compromised in this way. We note that the majority (62.5%) of the
IP addresses confirmed by the spam percentage are dynamic IP addresses, which further
indicates the likelihood of the machines to be compromised.
For the remaining 6 IP addresses that we cannot confirm by either of the above
means, we have also manually examined their sending patterns. We note that, they have a
relatively overall low percentage of spam messages over the two month of the collection
period. However, they sent substantially more spam messages towards the end of the
collection period. This indicates that they may get compromised towards the end of our
collection period. However, we cannot independently confirm if this is the case.
Evaluating the false negative rate of SPOT is a bit tricky by noting that SPOT focuses
on the machines that are potentially compromised, but not the machines that are normal (see
Chapter 5). In order to have some intuitive understanding of the false negative rate of the
SPOT system, we consider the machines that SPOT does not identify as being compromised
at the end of the email collection period, but for which SPOT has re-set the records (lines 15
to 18 in Algorithm 1). That is, such machines have been claimed as being normal by SPOT
(but have continuously been monitored). We also obtain the list of IP addresses that have sent
at least a message with a virus/worm attachment. 7 of such IP addresses have been claimed as
being normal, i.e., missed, by SPOT.
The infected messages are only used to confirm if a machine is compromised in order
to study the performance of SPOT. Infected messages are not used by SPOT itself. SPOT
relies on the spam messages instead of infected messages to detect if a machine has been
compromised to produce the results in Table 6.4. We make this decision by noting that, it is
against the interest of a professional spammer to send spam messagesSuch messages are
more likely to be detected by anti-virus software’s, and hence deleted before reaching the
intended recipients. This is confirmed by the low percentage of infected messages in the
overall email trace shown in Table 6.1. Infected messages are more likely to be observed
during the spam zombie recruitment phase instead of spamming phase. Infected messages can
be easily incorporated into the SPOT system to improve its performance.
The actual false positive rate and the false negative rate are higher than the specified false
positive rate and false negative rate, respectively. One possible reason is that the evaluation
was based on the SPAM emails, which can only provide a partial view of the outgoing
messages originated from inside SPAM.
The number of actual observations that SPOT takes to detect the compromised machines.
As we can see from the figure, the vast majority of compromised machines can be detected
with a small number of observations. For example, more than 80% of the compromised
machines are detected by SPOT with only 3 observations. All the compromised machines are
detected with no more than 11 observations. This indicates that, SPOT can quickly detect
the compromised machines. We note that SPOT does not need compromised machines to
send spam messages at a high rate in order to detect them. Here, "quick" detection does not
mean a short duration, but rather a small number of observations. A compromised machine
can send spam messages at a low rate (which, though, works against the interest of
spammers), but it can still be detected once enough observations are obtained by SPOT.
6.4 PERFORMANCE EVALUATION DESIGNS
Evaluation designs based on the number of spam messages (CT) and another percentage
of spam messages sent from a machine (PT). In this section, we evaluate the performance of
CT and PT based on the user-defined parameters.
Figure 6.1 : Counter Threshold
Both CT and PT need a fixed time window T and an appropriate user-defined threshold.
In this evaluation, we set T to 1 hour. For CT, we set the threshold to 30 messages, which
means a machine will be thought to be a zombie if it sends more than 30 spam messages in
any one hour window; for PT, we set the threshold to 50%, which means a machine will be
thought to be a zombie if it sends more than 50% spam messages in any one hour window.
Since SPOT needs at least 3 spam message (when a = 0.01, /// = 0.01, 90= 0.2, and 91= 0.9) to
detect a zombie, to compare PT to SPOT, we require at least 6 messages has been sent from
specific IP addresses. So, we ignore all of IP addresses which send less than 6 messages.
Figure 6.2 : PT Detection
Algorithm Total # SPAM IP Detected Confirmed (%) Missed (%)

CT 440 81 79 (59.8) 53 (40.2)
PT 440 84 83 (61.9) 51 (38.1)
Table6.1 :Performances of CT and PT

The result shows in Table 6.5. CT detects 81 zombies out of 440 total SPAM IP
addresses, which is 61 .36% of what SPOT has detected and PT detects 84 zombies out of
total 440 SPAM IP addresses, which is 63.63% of what SPOT has detected. Although CT and
PT have a high accuracy (for CT, 79 out of 81 machines has been confirmed to be zombie;
for PT, 83 out of 84 machines has been confirmed to be zombie), if we consider the number
of zombies they missed(for CT, it is 53; for PT, it is 51. Here, we use machines that send
virus as a verification, too.), CT only detects 59.8% of verified zombies; PT only detects
61.9% of verified zombies. Also, We observed all of zombies detected by CT or PT have
been detected by SPOT, and all of confirmed zombies that are detected by CT(79) and
PT(83) fall into the set of confirmed zombies that are detected by SPOT(126). This proves
that SPOT has more detection power over CT and PT.
Figure 6.3: Threshold Calculation
6.4 DYNAMIC IP ADDRESSES
In order to understand the potential impacts of dynamic IP addresses on the performance

of SPOT, CT and PT, we group messages from a dynamic IP address (with domain names
containing "wireless") into clusters with a time interval threshold of 30 minutes. Messages
with a consecutive inter-arrival time no greater than 30 minutes are grouped into the same
cluster. Given the short inter-arrival duration of messages within a cluster, we consider all the
messages from the same IP address within each cluster as being sent from the same machine.
That is, the corresponding IP address has not been re-assigned to a different machine within
the concerned cluster. (It is possible that messages from multiple adjacent clusters are
actually sent from the same machine.)
It shows the cumulative distribution function (CDF) of the number of spam messages in
each cluster. From the figure we can see that more than 90% of the clusters have no less than
10 spam messages, and more than 96% no less than 3 spam messages. Given the large
number of spam messages sent within each cluster, it is unlikely for SPOT to mistake one
compromised machine as another when it tries to detect spam zombies. Indeed, we have
manually checked that, spam messages tend to be sent back to back in a batch fashion when a
dynamic IP address is observed in the trace. Figure 6.4 shows the CDF of the number of all
messages (including both spam and non-spam) in each cluster. Similar observations can be
made to that in.
Figure 6.4 :SPOT Implementation of SPAM Message
Itshows the CDF of the durations of the clusters. it can see from the figure, more than
75% and 58% of the clusters last no less than 30 minutes and one hour (corresponding to the
two vertical lines in the figure), respectively. The longest duration of a cluster we observe in
the trace is about 3.5 hours.
Given the above observations, in particular, the large number of spam messages in each
cluster, we conclude that dynamic IP addresses will not have any important impact on the
performance of SPOT. SPOT can reach a decision within the vast majority (96%) it cannot be
a mail relay server deployed by the network. In practice, a network may have multiple sub
domains and each has its own mail servers. A message may be forwarded by a number of
mail relay servers before leaving the network. SPOT can work well in this kind of network
environments. In the following we outline two possible approaches. First, SPOT can be
deployed at the mail servers in each sub domain to monitor the outgoing messages so as to
detect the compromised machines in that sub domain. Second, and possibly more practically,
SPOT is only deployed at the designated mail servers, which forward all outgoing messages
(or SPOT gets a replicated stream of all outgoing messages), as discussed in Chapter 3. SPOT
relies on the Received header fields to identify the originating machine of a message in the
network. Given that the Received header fields can be spoofed by spammers, SPOT should
only use the Received header fields inserted by the known mail servers in the network.
SPOT can determine the reliable Received header fields by backtracking from the last
known mail server in the network that forwards the message. It terminates and identifies the
originating machine when an IP address in the Received header field is not associated with a
known mail server in the network.
Figure 6.5 : SPOT Implementation of SPAM Message

Given that SPOT relies on (content-based) spam filters to classify messages into spam
and non-spam, spammers may try to evade the developed SPOT system by evading the
deployed spam filters. They may send completely meaningless non-spam messages (as
classified by spam filters). However, this will reduce the real spamming rate, and hence, the
financial gains, of the spammers. More importantly, as shown in Figure 5.1 (b), even if a
spammer reduces the spam percentage to 50%, SPOT can still detect the spam zombie with a
relatively small number of observations (25 when a = 0.01, 3 = 0.01, and 90 = 0.2). So, trying
to send non-spam messages will not help spammers to evade the SPOT system. Moreover, in
certain environment where user feedback is reliable, for example, feedback from users of the
same network in which SPOT is deployed, SPOT can rely on classifications from end users
(in addition to the spam filter). Although completely meaningless messages may evade the
deployed spam filter, it is impossible for them to remain undetected by end users who receive
such messages. User feedbacks may be incorporated into SPOT to improve the spam
detection rate of the spam filter. As we have discussed, trying to send spam at a low rate will
also not evade the SPOT system. SPOT relies on the number of (spam) messages, not the
sending rate, to detect spam zombies.
By contrast, it is sending rate that affects the detection very much for CT and PT. If
spammers reduce the number or percentage of spam messages in a time window, the user-
defined threshold will be compromised. For example, we set 30 messages as the threshold in
CT. Once this information has been figured out by spammers, they are able to break this by
sending less than 30 spam messages in any time window. Moreover, if they know the size of
a time window, they could send even more spam messages by crossing two windows'
boundary, ie. Send 29 at the end of the first window, then send another 29 at the beginning of
the next window. The similar scheme can be used for attacking PT.
CHAPTER 7
CONCLUSION
The proposed system aims to deletes email with attached viruses. It detects and blocks
the spammers by using a SPOT detection algorithm. The account reactivation test is provided
by the system. The flow of work on the proposed system. The proposed system works as the
system receives an email message. Then system checks for virus in the attachment. The
system deletes email having virus files in an attachment. If the virus is not found, then the
message is checked for spam. The system applies SPOT detection algorithm to detect
spammers.
The system maintains the database on a machine where it will be running. To detect
the spam message some spam pattern will be used. To find an email having virus files,
dataset with unique patterns will be used. These unique patterns will be the file extensions.
All incoming email messages will be scanned against the dataset.
Compromised machines are a major security threat on the Internet. Given that
spamming provides the critical economic incentive for attackers to recruit the large number
of compromised machines, in this thesis we developed SPOT, an effective spam zombie
detection system by monitoring outgoing messages in a network. SPOT was designed based
on a simple and powerful statistical tool named Sequential Probability Ratio Test to detect the
subset of compromised machines that are involved in the spamming activities. SPOT has
bounded false positive and false negative error rates. It also minimizes the number of
required observations to detect a spam zombie. Our evaluation studies based on a 2-month
email trace collected on the FSU campus network showed that SPOT is an effective and
efficient system in automatically detecting compromised machines in a network. Have also
evaluated two alternative designs based on spam count and spam fraction. The results show
that SPOT is over them in both detection number and detection accuracy. In summary, they
are not as effective as SPOT.
CHAPTER 8
FUTURE WORK
 Implementing a collaborative system where multiple mail recipients contribute

information about the network-level characteristics (as well as content-based
features) which would allow all collaborators to improve their spam classifiers
and to quickly identify spammers that may be “new” from the perspective of just a
single domain.
 In Future work, implementing a collaborative system where multiple mail recipients
contribute information about the network-level characteristics (as well as content-
based features) which would allow all collaborators to improve their spam classifiers
and to quickly identify spammers that may be “new” from the perspective of just a
single domain.
 The spammers and phishes also pose a serious threat to the emerging technologies
such as voice over IP and video over IP. Techniques should also be improved to focus
these threats also It is possible to analyze the contents of e-mails before they are
delivered to the recipient, but VoIP does not which poses a challenge towards its
mitigation. Various techniques based on Trust and reputation.
BIBLIOGRAPHY
REFERENCES
[1] J. Donfro. A Whopping 20% of Yelp Reviews are Fake, accessed onJul. 30, 2015.
[Online]. Available: http://www.businessinsider.com/20-percent-of-yelp-reviews-fake-2013-
9
[2] M. Ott, C. Cardie, and J. T. Hancock, “Estimating the prevalence ofdeception in online
review communities,” in Proc. ACM WWW, 2012,pp. 201–210.
[3] M. Ott, Y. Choi, C. Cardie, and J. T. Hancock, “Finding deceptiveopinion spam by any
stretch of the imagination,” in Proc. ACL, 2011,pp. 309–319.
[4] C. Xu and J. Zhang, “Combating product review spam campaigns viamultiple

heterogeneous pairwise features,” in Proc. SIAM Int. Conf. DataMining, 2014, pp. 172–180.
[5] N. Jindal and B. Liu, “Opinion spam and analysis,” in Proc. WSDM,2008, pp. 219–230.
[6] F. H. Li, M. Huang, Y. Yang, and X. Zhu, “Learning to identify reviewspam,” in Proc.
22nd Int. Joint Conf. Artif. Intell. (IJCAI), 2011, pp. 1–6.
[7] G. Fei, A. Mukherjee, B. Liu, M. Hsu, M. Castellanos, and R.
Ghosh,“Exploitingburstiness in reviews for review spammer detection,” inProc. ICWSM,
2013, pp. 1–10.
[8] A. J. Minnich, N. Chavoshi, A. Mueen, S. Luan, and M. Faloutsos,“Trueview:
Harnessing the power of multiple review sites,” in Proc.ACM WWW, 2015, pp. 787–797.
[9] B. Viswanathet al., “Towards detecting anomalous user behavior inonline social
networks,” in Proc. USENIX, 2014, pp. 1–16.
[10] H. Li, Z. Chen, B. Liu, X. Wei, and J. Shao, “Spotting fake reviewsvia collective
positive-unlabeled learning,” in Proc. ICDM, Dec. 2014,pp. 899–904.
[11] L. Akoglu, R. Chandy, and C. Faloutsos, “Opinion fraud detection inonline reviews by
network effects,” in Proc. ICWSM, 2013, pp. 1–10.
[12] S. Rayana and L. Akoglu, “Collective opinion spam detection: Bridgingreview networks
and metadata,” in Proc. ACM KDD, 2015, pp. 1–10.
[13] S. Feng, R. Banerjee, and Y. Choi, “Syntactic stylometry for deceptiondetection,” in

Proc. 50th SNMu. Meeting Assoc. Comput. Linguistics(ACL), 2012, pp. 1–5.
[14] L. Zhuang, J. Dunagan, D.R. Simon, H.J. Wang, I. Osipkov, G. Hulten, and J.D. Tygar,
CharacterizingBotnets from Email Spam Records,‖ Proc. First Usenix Workshop Large-Scale
Exploits and EmergentThreats, Apr. 2008.
[15] M. Xie, H. Yin, and H. Wang, ―An Effective Defense against Email Spam Laundering,‖
Proc. ACM Conf.Computer and Comm. Security, Oct./Nov. 2006.
[16] G. Gu, P. Porras, V. Yegneswaran, M. Fong, and W. Lee, ―BotHunter: Detecting
Malware Infectionthrough Ids-Driven Dialog Correlation,‖ Proc. 16th USENIX Security
Symp., Aug. 2007.
[17] G. Gu, J. Zhang, and W. Lee, ―BotSniffer: Detecting Botnet Command and Control
Channels in NetworkTraffic,‖ Proc. 15th Ann. Network and Distributed System Security
Symp. (NDSS ’08), Feb. 2008.
[18] G. Gu, R. Perdisci, J. Zhang, and W. Lee, ―BotMiner: Clustering Analysis of Network
Traffic for ProtocolandStructure-Independent Botnet Detection,‖ Proc. 17th USENIX
Security Symp., July 2008.

Final Doc SPAM

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Final Doc SPAM

Uploaded by

Copyright:

Available Formats

CHAPTER 1

1.2 AIM & OBJECTIVE

1.3 PROBLEM DESCRIPTION

1.6 RESEARCH DESIGN

1.7 CHALLENGES OF THESIS

Rather than the aggregate global characteristics of spamming botnets, we aim to

2.1 BACK GROUND

2.1.1 EMAIL COMMUNICATION

2.1.2 MESSAGE HANDLING

Figure 2.1:SPAM Analysis

2.2.1 DEFINING SPAM

2.3 SPAM FILTERING

2.3.1 MACHINE LEARNING MODELS

2.3.2 CONTENT-BASED FILTERING

2.3.3 NON CONTENT BASED FILTERING

2.3.4 TRAINING AND TESTING DATA

2.4 CLASSIFICATION MODELS

Figure 2.3: Classification models

# misidentified as target examples

2.5 RELATED PAPER

Author: Malik MateenMuhammadAleem

 Increase unpopularity of social networking sites allows us to gather enormous amount

Title 2: NetSpam: A Network-Based Spam Detection Framework for Reviews in Online

Figure 3.1: Network model.

3.1 BACKGROUND ON SEQUENTIAL PROBABILITY RATIO TEST

3.2 SPAM NETWORK METHOD

Artificial neural networks (SNM) or connectionist systems are computing systems

In common SNMimplementations, the signal at a connection between artificial neurons are a

3.2.1 WEB SPAM DETECTION

Figure 3.2:Classification of Web Spam Techniques

Genetic Programming could find the optimized discriminating function to

Figure 3.3 : Email Spam Detection

Figure3.4: Web Spam Detection

The transformed link-based features are the simple combination or logarithm

4.1 MESSAGE SPAM

Figure 4.1: High Level Taxonomy of Spam

In classify web-spam as request-response typeset-mail and sms-spam are store-

4.2 IMPLEMENTATION OF SPAM DETECTION

ii. SPOT filter

iii. SPOT results

 In this module to check the mail id and password.

 If these two fields are valid, the account is authenticated.

 Otherwise is not valid.

 This mails either spam or non spam.

 And it contains more than 20 lines.

 In this module to capture the IP address of the system.

 That system mails are applied to filtering process.

 In this process, the mail content is filtered.

 Finally to produce the result of filter.

 In this module to set the threshold value Cs.

 Cs denotes the fixed length of spam mail.

 Also to count the number of lines in each mail.

 So, these mails are spam mail.

 In this module to set two threshold values.

 If it’s true these mails are spam mail.

1. Date - All messages between a set of dates.

3. Label - We can view specific emails, such as spam or virus.

4. MYSQL - An MYSQL statement can be defined to specifically choose a subset of the

4.3 ATTACHMENT STATICS

Attachment statistics can be viewed, saved, or selected for further analysis. In

Year Number of Emails

4.4 EMAIL USED IN THE SPAM FRAMEWORK DATA SET

Row Notation Type MetaPath Semantic

4.5 REVIEW DATASETS USED IN THIS WORK