A Comparison of Feature-Based and Feature-Free Case-Based Reasoning For Spam Filtering

A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering
University College Cork work done with Dublin Institute of Technology
Derek Bridge
Sarah Jane Delany
Overview
Introduction Case-Based Spam Filtering
Feature-Based Feature-Free Experiments I
Case Base Maintenance

Concept Drift Conclusions
Competence-Based Editing Experiments II Incremental & periodic solutions Experiments III
Introduction
From the Spamhaus project (www.spamhaus.org)

An electronic message is spam IF:
1) the recipient's personal identity and context are irrelevant because the message is equally applicable to many other potential recipients; AND 2) the recipient has not verifiably granted deliberate, explicit, and still-revocable permission for it to be sent.
We focus on email spam
[Its] about consent, not content
Spam Filtering
Spam filtering is classification:
is an incoming email ham or spam?
procedural
whitelists, blacklists, challenge-response systems,
Spam filters
collaborative
sharing signatures
content-based
rules, decision trees, probabilities, case bases,
hybrid.
Challenges of Spam Filtering

Spam is subjective and personal; It is heterogeneous; There is a high costs to false positives (where ham is classified as spam); and It is constantly changing (concept
drift).
Overview

Case-Based Reasoning
New problem RETRIEVE
Learned Case Previous Case RETAIN General knowledge
Retrieved Case
REUSE
Tested/ Repaired Case

REVISE
Adapted Case
[Aamodt & Plaza 1994]
Case-Based Reasoning
MAINTAIN
Previous Case General knowledge
Is Case-Based Reasoning (CBR) the answer?
Spam is subjective and personal; It is heterogeneous; There is a high costs to false positives (where ham is classified as spam); and It is constantly changing (concept
drift).
Users We It Case is can can known bases bias have that CBR can individual away CBR be case updated from handles bases false incrementally disjunctive created positives from their concepts own emails well
Overview

Email Classification Using Examples (ECUE)

ECUE uses Case-Based Reasoning (CBR) to classify emails A case base contains a users email (both ham and spam) ECUE classifies an incoming email using the knearest neighbour algorithm:
It retrieves from the case base the k nearest neighbours (the k that are closest or most similar) The cases it retrieves then vote to decide the class of the new email To bias away from false positives, ECUE uses unanimous voting.
Feature-Based ECUE
Email Email Email Email Feature Extraction
Casebase
Email ei fi1 , fi 2 ,... fiN , class label Features extracted (fij )
words, characters, structural features
Binary representation: fi1= 1 or fi1= 0
Feature-Based ECUE
Casebase
Feature Selection
Casebase
Information Gain used to select the 700 most predictive features
Feature-Based ECUE
Casebase
Feature Selection
Casebase
Case Selection
Casebase
Competence-Based Editing used to edit case base
Feature-Based ECUE
Casebase
Feature Selection
Runtime System
Casebase New Case
Classification
Casebase
Case Selection
spam!
Feature-Based ECUE
The distance between cases is a count of the number of features that they do not share Nave Bayes classifier thought to be among the best for spam filtering Feature-Based ECUE has comparable, and sometimes slightly better, accuracy than Nave Bayes
Overview

Feature-Free ECUE
Alternative to Feature-Based ECUE Inspired by theory of Kolmogorov
Complexity
Basis for distance measure
K(x) = size of smallest Turing machine that can output x to its tape K(x|y) = size of smallest Turing machine that can output x when given y
if K(x|y) < K(x|z) then y is more similar to x than z

[Li et al. 2003]
Feature-Free ECUE
Approximate K(x) by C(x)
C(x) = size of x after compression
Text compression exploits intra-document redundancy

Case b d reasoning based reasoning
Using Compression
Consider length of two documents allowing for inter-document redundancy
= len(gzip( = len(gzip( = len( = C(xy) docX docX docX docY + docY )) ))
docY )
Using Compression
Consider length of two documents not allowing for inter-document redundancy
= len(gzip(
docX
)) + len(gzip(
docY
))
= len(
docX
) + len(
docY
= C(x) + C(y)
Compression-Based Dissimilarity (CDM)

C ( xy ) CDM ( x, y ) C ( x) C ( y)
Max value 1 (furthest) Min value > 0.5 (nearest) However
CDM(x,x) 0; CDM(x,y) CDM(y,x); CDM(x,y) + CDM(y,z) CDM(x,z)
[Keogh et al 2004]
Feature-Based ECUE
Casebase
Feature Selection
Runtime System
Casebase New Email
Classification
Casebase
Case Base Edit
spam!
Feature-Free ECUE
Runtime System New Email Classification
Email Email Email Email Email

Casebase
Email Email Email

Casebase
Case Base Edit
spam!
Experiments I
Created 4 datasets of 1000 emails from two years of email from two people 10-fold cross-validation Settings:
each dataset has 500 consecutive ham, 500 consecutive spam
Measures:
k=3 Feature-based: 700 features Feature-free: GZip as text compressor FPRate = #false positives/#ham FNRate = #false negatives/#spam Err = (FPRate + FNRate) / 2
Results - % Error
13.2% Dataset 1 Dataset 2 9.8% Dataset 3 Dataset 4
5.7% 4.0% 2.4% 0.2% Feature-Based Feature-Free (GZip) 2.2% 1.5%
Results - % False Positives

9.2% Dataset 1 Dataset 2 Dataset 3 Dataset 4
1.4%
1.0%
1.4% 0.6% 0.0% 0.8%
1.2%
Feature-Based
Feature-Free (GZip)
Overview
Case-Based Spam Filtering Case Base Maintenance Concept Drift Conclusions
Feature-Based & Feature-Free Experiments I

Case base editing algorithms
remove redundant cases, and remove noisy cases.
Their goal is to
reduce retrieval time but maintain or even improve accuracy.
Competence Model
For each case c, compute
coverage set of c liability set of c
cases that have c as one of their k-NN and which have same class as c cases that have c as one of their k-NN and which have different class from c
x
y c x is in coverage set of c y is in liability set of c
Competence-Based Editing
Blame-Based Noise Reduction
For each case c with non-empty liability set (taken in descending order of size of liability set), This emphasises removal of cases that cause misclassifications.
if the cases in cs coverage set can still be correctly classified without c, then c can be deleted.
Conservative Redundancy Reduction
For each remaining case c (taken in ascending order of size of coverage set) This retains cases close to class boundaries
retain c but delete the cases in cs coverage set
Results - % Error
9.8% 7.0% 5.7% 3.8% 2.4% 2.2% 2.6% 2.2% Dataset 1 Dataset 3
FeatureBased (full)
FeatureBased (edited)
Feature-Free Feature-Free (full) (edited)
Feature-based edited size = 75% and 65% Feature-free edited size = 59% and 57%

9.2% Dataset 1 Dataset 3
3.4% 2.2% 1.0%
1.4% 0.8%
1.0% 0.4%
FeatureBased (full)
FeatureBased (edited)
Feature-Free Feature-Free (full) (edited)
Overview
Concept Drift
The target concept is not static
it changes according to season it changes according to world events peoples interests and tolerances change there is an arms race:
ever more devious spamouflage!
We need to investigate behaviour over time
Experiments III
Took ~10000 emails from two years of email from two people in date-order Created a case base for each person from earliest 500 consecutive ham & earliest 500 consecutive spam Remaining ~9000 emails presented chronologically as test cases Same settings and measures as before
k=3 Feature-based: 700 features Feature-free: GZip as text compressor
Retention policies
CBR (and other lazy learners) can easily incorporate the most recent examples
retain-all: store all new emails in the case base retain-misclassifieds: store a new email if our prediction is wrong
Results - % Error
15.9% Dataset A 12.6% Dataset B
2.3%
3.2%
Feature-Free (GZip)
Feature-Free (GZip): retain-misclassifieds
When we retain-misclassified cases, case bases increase in size by ~30%

4.0% Dataset A Dataset B 3.5%
1.5% 0.7%
Feature-Free (GZip)
Feature-Free (GZip): retain-misclassifieds
Retention
Bigger case base reduces efficiency Obsolete cases may reduce accuracy Obsolete features may reduce accuracy Need a deletion policy
Incremental Solutions
Consider add-1-delete-1
Case base size remains constant retention policy
retain-all retain-misclassified
forgetting policy
forget-oldest forget-least-accurate
instance selection instance weighting
Incremental Solutions
Consider add-1-delete-1
Case base size remains constant retention policy
retain-all retain-misclassified #successes #retrievals
forgetting policy
forget-oldest forget-least-accurate
Accuracy =
Results - % Error
15.9% 12.6% Dataset A Dataset B
3.2% 2.3%
Feature-Free Feature-Free: retainmisclassifieds, forget-oldest
2.8% 1.7%
Feature-Free: retain-all, forgetoldest
4.0% 1.8%
Feature-Free: retainmisclassifieds, forget-leastaccurate
3.0% 1.9%
Feature-Free: retain-all, forgetleast-accurate

Dataset A 4.0% Dataset B 4.2% 1.7% 1.8% 2.4% 6.4% 5.0% 3.5% 1.3%
0.7%
Feature-Free
Feature-Free: retainmisclassifieds, forget-oldest
Feature-Free: retain-all, forgetoldest
Feature-Free: retainmisclassifieds, forget-leastaccurate
Feature-Free: retain-all, forgetleast-accurate
Negative effect on FPs?
Periodic Solutions
Periodic
Feature-based:
retain-misclassified; monthly, feature re-extraction, feature reselection, case base rebuild and case base edit
Feature-free
retain-misclassified; monthly, case base edit
Feature-Based ECUE
Casebase
Feature Selection
Casebase
Case Base Edit

Casebase
Results - % Error
Dataset A 19.2% 15.4% Dataset B 15.9% 12.6% 4.5% 6.1% 2.3% 2.6%
Feature-Free Feature-Free: retainmisclassifieds, monthly edit
Feature-Based
Feature-Based: retainmisclassifieds, monthly reselect & edit

20.0% 14.7% Dataset A Dataset B
2.0% 2.4%
Feature-Based Feature-Based: retainmisclassifieds, monthly reselect & edit
4.0% 0.7%
Feature-Free
0.9%
2.5%
Feature-Free: retainmisclassifieds, monthly edit
Overview
Feature-Free ECUE: Advantages

Accuracy
lower error rate than traditional feature-based methods often lower false positive rate
it uses the raw text no need to extract, select or weight features no need to update features as spam changes simple retention/forgetting policies can be effective
Costs
Concept drift
Feature-Free ECUE: Disadvantages

No justification factors to explain results or drive adaptation Higher computation time
Time to classify email (with cb of 1000) Feature-free = 2 secs Feature-based = .01 sec
Not a metric
Future Work
Investigating algorithms to speed up retrieval time Application of measure to text other than emails
Thank you for your attention!
Spare slides
Normalized Compression Distance (NCD)

C ( xy ) min( C ( x), C ( y )) NCD( x, y ) max( C ( x), C ( y ))
Max value = 1 + (furthest) Min value = 0 (nearest) However
NCD(x,x) 0; NCD(x,y) NCD(y,x); NCD(x,y) + NCD(y,z) NCD(x,z)
[Li et al 2003]
Comparing Compression Algorithms

The better the compression the better the measure? Compared GZip with Prediction by Partial Matching (PPM)
GZip = Lempel-Ziv variant PPM = adaptive statistical compressor
Results - % Error
Dataset 1 Dataset 2 Dataset 3 Dataset 4 2.4% 2.4% 1.4% 2.3% 1.9% 1.1% 0.2%
PPM(2)
2.1% 2.2% 1.6%
2.5% 2.0% 1.7%
0.1%
GZip
0.2%
PPM(4)
0.2%
PPM(8)
Results
Little difference in classification error
Compressor choice does not greatly matter
PPM is generally considered better at compression but on our datasets...

average of 59% compression for GZip average 57% compression for PPM
PPM computationally expensive

180 times slower than GZip
GZip Speed Up
GZip uses a 32 KByte sliding window
32KB
docX
docY
Truncate each email to 16KB Achieves speed ups of between 9.5% to 25%

A Comparison of Feature-Based and Feature-Free Case-Based Reasoning For Spam Filtering

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Comparison of Feature-Based and Feature-Free Case-Based Reasoning For Spam Filtering

Uploaded by

Copyright:

Available Formats

A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering

University College Cork work done with Dublin Institute of Technology

Sarah Jane Delany

Case Base Maintenance

Competence-Based Editing Experiments II Incremental & periodic solutions Experiments III

From the Spamhaus project (www.spamhaus.org)

We focus on email spam

[Its] about consent, not content

Challenges of Spam Filtering

Case Base Maintenance

Competence-Based Editing Experiments II Incremental & periodic solutions Experiments III

Learned Case Previous Case RETAIN General knowledge

Tested/ Repaired Case

[Aamodt & Plaza 1994]

Previous Case General knowledge

Is Case-Based Reasoning (CBR) the answer?

Case Base Maintenance

Competence-Based Editing Experiments II Incremental & periodic solutions Experiments III

Email Classification Using Examples (ECUE)

Email ei fi1 , fi 2 ,... fiN , class label Features extracted (fij )

words, characters, structural features

Binary representation: fi1= 1 or fi1= 0

Information Gain used to select the 700 most predictive features

Competence-Based Editing used to edit case base

Case Base Maintenance

Competence-Based Editing Experiments II Incremental & periodic solutions Experiments III

Basis for distance measure

if K(x|y) < K(x|z) then y is more similar to x than z

Text compression exploits intra-document redundancy

Compression-Based Dissimilarity (CDM)

Case Base Edit

Runtime System New Email Classification

Email Email Email Email Email

Email Email Email

Case Base Edit

5.7% 4.0% 2.4% 0.2% Feature-Based Feature-Free (GZip) 2.2% 1.5%

Results - % False Positives

1.4% 0.6% 0.0% 0.8%

Case Base Maintenance

Conservative Redundancy Reduction

Feature-Free Feature-Free (full) (edited)

Results - % False Positives

3.4% 2.2% 1.0%

Feature-Free Feature-Free (full) (edited)

We need to investigate behaviour over time

Feature-Free (GZip): retain-misclassifieds

When we retain-misclassified cases, case bases increase in size by ~30%

Results - % False Positives

Feature-Free (GZip): retain-misclassifieds

instance selection instance weighting

Results - % False Positives

Feature-Free: retainmisclassifieds, forget-oldest

Feature-Free: retain-all, forgetoldest

Feature-Free: retainmisclassifieds, forget-leastaccurate

Feature-Free: retain-all, forgetleast-accurate

Negative effect on FPs?

Case Base Edit

Feature-Based: retainmisclassifieds, monthly reselect & edit

Results - % False Positives

Feature-Free: retainmisclassifieds, monthly edit

Feature-Free ECUE: Advantages

Feature-Free ECUE: Disadvantages

Thank you for your attention!