You are on page 1of 59

A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering

University College Cork work done with Dublin Institute of Technology

Derek Bridge

Sarah Jane Delany

Overview
Introduction Case-Based Spam Filtering
Feature-Based Feature-Free Experiments I

Case Base Maintenance


Concept Drift Conclusions

Competence-Based Editing Experiments II Incremental & periodic solutions Experiments III

Introduction

From the Spamhaus project (www.spamhaus.org)


An electronic message is spam IF:
1) the recipient's personal identity and context are irrelevant because the message is equally applicable to many other potential recipients; AND 2) the recipient has not verifiably granted deliberate, explicit, and still-revocable permission for it to be sent.

We focus on email spam

[Its] about consent, not content

Spam Filtering
Spam filtering is classification:
is an incoming email ham or spam?
procedural
whitelists, blacklists, challenge-response systems,

Spam filters
collaborative
sharing signatures

content-based
rules, decision trees, probabilities, case bases,

hybrid.

Challenges of Spam Filtering


Spam is subjective and personal; It is heterogeneous; There is a high costs to false positives (where ham is classified as spam); and It is constantly changing (concept

drift).

Overview
Introduction Case-Based Spam Filtering
Feature-Based Feature-Free Experiments I

Case Base Maintenance


Concept Drift Conclusions

Competence-Based Editing Experiments II Incremental & periodic solutions Experiments III

Case-Based Reasoning
New problem RETRIEVE

Learned Case Previous Case RETAIN General knowledge

Retrieved Case

REUSE

Tested/ Repaired Case


REVISE

Adapted Case

[Aamodt & Plaza 1994]

Case-Based Reasoning
MAINTAIN

Previous Case General knowledge

Is Case-Based Reasoning (CBR) the answer?

Spam is subjective and personal; It is heterogeneous; There is a high costs to false positives (where ham is classified as spam); and It is constantly changing (concept

drift).

Users We It Case is can can known bases bias have that CBR can individual away CBR be case updated from handles bases false incrementally disjunctive created positives from their concepts own emails well

Overview
Introduction Case-Based Spam Filtering
Feature-Based Feature-Free Experiments I

Case Base Maintenance


Concept Drift Conclusions

Competence-Based Editing Experiments II Incremental & periodic solutions Experiments III

Email Classification Using Examples (ECUE)


ECUE uses Case-Based Reasoning (CBR) to classify emails A case base contains a users email (both ham and spam) ECUE classifies an incoming email using the knearest neighbour algorithm:
It retrieves from the case base the k nearest neighbours (the k that are closest or most similar) The cases it retrieves then vote to decide the class of the new email To bias away from false positives, ECUE uses unanimous voting.

Feature-Based ECUE
Email Email Email Email Feature Extraction
Casebase

Email ei fi1 , fi 2 ,... fiN , class label Features extracted (fij )

words, characters, structural features

Binary representation: fi1= 1 or fi1= 0

Feature-Based ECUE
Email Email Email Email Feature Extraction
Casebase

Feature Selection

Casebase

Information Gain used to select the 700 most predictive features

Feature-Based ECUE
Email Email Email Email Feature Extraction
Casebase

Feature Selection

Casebase

Case Selection
Casebase

Competence-Based Editing used to edit case base

Feature-Based ECUE
Email Email Email Email Feature Extraction
Casebase

Feature Selection

Runtime System
Casebase New Case

Classification
Casebase

Case Selection

spam!

Feature-Based ECUE
The distance between cases is a count of the number of features that they do not share Nave Bayes classifier thought to be among the best for spam filtering Feature-Based ECUE has comparable, and sometimes slightly better, accuracy than Nave Bayes

Overview
Introduction Case-Based Spam Filtering
Feature-Based Feature-Free Experiments I

Case Base Maintenance


Concept Drift Conclusions

Competence-Based Editing Experiments II Incremental & periodic solutions Experiments III

Feature-Free ECUE
Alternative to Feature-Based ECUE Inspired by theory of Kolmogorov

Complexity

Basis for distance measure

K(x) = size of smallest Turing machine that can output x to its tape K(x|y) = size of smallest Turing machine that can output x when given y

if K(x|y) < K(x|z) then y is more similar to x than z


[Li et al. 2003]

Feature-Free ECUE
Approximate K(x) by C(x)
C(x) = size of x after compression

Text compression exploits intra-document redundancy


Case b d reasoning based reasoning

Using Compression
Consider length of two documents allowing for inter-document redundancy
= len(gzip( = len(gzip( = len( = C(xy) docX docX docX docY + docY )) ))

docY )

Using Compression
Consider length of two documents not allowing for inter-document redundancy
= len(gzip(

docX

)) + len(gzip(

docY

))

= len(

docX

) + len(

docY

= C(x) + C(y)

Compression-Based Dissimilarity (CDM)


C ( xy ) CDM ( x, y ) C ( x) C ( y)
Max value 1 (furthest) Min value > 0.5 (nearest) However
CDM(x,x) 0; CDM(x,y) CDM(y,x); CDM(x,y) + CDM(y,z) CDM(x,z)

[Keogh et al 2004]

Feature-Based ECUE
Email Email Email Email Feature Extraction
Casebase

Feature Selection

Runtime System
Casebase New Email

Classification
Casebase

Case Base Edit

spam!

Feature-Free ECUE

Runtime System New Email Classification

Email Email Email Email Email


Casebase

Email Email Email


Casebase

Case Base Edit

spam!

Experiments I
Created 4 datasets of 1000 emails from two years of email from two people 10-fold cross-validation Settings:
each dataset has 500 consecutive ham, 500 consecutive spam

Measures:

k=3 Feature-based: 700 features Feature-free: GZip as text compressor FPRate = #false positives/#ham FNRate = #false negatives/#spam Err = (FPRate + FNRate) / 2

Results - % Error
13.2% Dataset 1 Dataset 2 9.8% Dataset 3 Dataset 4

5.7% 4.0% 2.4% 0.2% Feature-Based Feature-Free (GZip) 2.2% 1.5%

Results - % False Positives


9.2% Dataset 1 Dataset 2 Dataset 3 Dataset 4

1.4%

1.0%

1.4% 0.6% 0.0% 0.8%

1.2%

Feature-Based

Feature-Free (GZip)

Overview
Case-Based Spam Filtering Case Base Maintenance Concept Drift Conclusions
Feature-Based & Feature-Free Experiments I
Competence-Based Editing Experiments II Incremental & periodic solutions Experiments III

Case Base Maintenance


Case base editing algorithms
remove redundant cases, and remove noisy cases.

Their goal is to
reduce retrieval time but maintain or even improve accuracy.

Competence Model
For each case c, compute
coverage set of c liability set of c
cases that have c as one of their k-NN and which have same class as c cases that have c as one of their k-NN and which have different class from c
x
y c x is in coverage set of c y is in liability set of c

Competence-Based Editing
Blame-Based Noise Reduction
For each case c with non-empty liability set (taken in descending order of size of liability set), This emphasises removal of cases that cause misclassifications.
if the cases in cs coverage set can still be correctly classified without c, then c can be deleted.

Conservative Redundancy Reduction

For each remaining case c (taken in ascending order of size of coverage set) This retains cases close to class boundaries
retain c but delete the cases in cs coverage set

Results - % Error
9.8% 7.0% 5.7% 3.8% 2.4% 2.2% 2.6% 2.2% Dataset 1 Dataset 3

FeatureBased (full)

FeatureBased (edited)

Feature-Free Feature-Free (full) (edited)

Feature-based edited size = 75% and 65% Feature-free edited size = 59% and 57%

Results - % False Positives


9.2% Dataset 1 Dataset 3

3.4% 2.2% 1.0%

1.4% 0.8%

1.0% 0.4%

FeatureBased (full)

FeatureBased (edited)

Feature-Free Feature-Free (full) (edited)

Overview
Case-Based Spam Filtering Case Base Maintenance Concept Drift Conclusions
Feature-Based & Feature-Free Experiments I
Competence-Based Editing Experiments II Incremental & periodic solutions Experiments III

Concept Drift
The target concept is not static
it changes according to season it changes according to world events peoples interests and tolerances change there is an arms race:
ever more devious spamouflage!

We need to investigate behaviour over time

Experiments III
Took ~10000 emails from two years of email from two people in date-order Created a case base for each person from earliest 500 consecutive ham & earliest 500 consecutive spam Remaining ~9000 emails presented chronologically as test cases Same settings and measures as before
k=3 Feature-based: 700 features Feature-free: GZip as text compressor

Retention policies
CBR (and other lazy learners) can easily incorporate the most recent examples
retain-all: store all new emails in the case base retain-misclassifieds: store a new email if our prediction is wrong

Results - % Error
15.9% Dataset A 12.6% Dataset B

2.3%

3.2%

Feature-Free (GZip)

Feature-Free (GZip): retain-misclassifieds

When we retain-misclassified cases, case bases increase in size by ~30%

Results - % False Positives


4.0% Dataset A Dataset B 3.5%

1.5% 0.7%

Feature-Free (GZip)

Feature-Free (GZip): retain-misclassifieds

Retention
Bigger case base reduces efficiency Obsolete cases may reduce accuracy Obsolete features may reduce accuracy Need a deletion policy

Incremental Solutions
Consider add-1-delete-1
Case base size remains constant retention policy
retain-all retain-misclassified

forgetting policy
forget-oldest forget-least-accurate

instance selection instance weighting

Incremental Solutions
Consider add-1-delete-1
Case base size remains constant retention policy
retain-all retain-misclassified #successes #retrievals

forgetting policy

forget-oldest forget-least-accurate

Accuracy =

Results - % Error
15.9% 12.6% Dataset A Dataset B

3.2% 2.3%
Feature-Free Feature-Free: retainmisclassifieds, forget-oldest

2.8% 1.7%
Feature-Free: retain-all, forgetoldest

4.0% 1.8%
Feature-Free: retainmisclassifieds, forget-leastaccurate

3.0% 1.9%
Feature-Free: retain-all, forgetleast-accurate

Results - % False Positives


Dataset A 4.0% Dataset B 4.2% 1.7% 1.8% 2.4% 6.4% 5.0% 3.5% 1.3%

0.7%
Feature-Free

Feature-Free: retainmisclassifieds, forget-oldest

Feature-Free: retain-all, forgetoldest

Feature-Free: retainmisclassifieds, forget-leastaccurate

Feature-Free: retain-all, forgetleast-accurate

Negative effect on FPs?

Periodic Solutions
Periodic
Feature-based:
retain-misclassified; monthly, feature re-extraction, feature reselection, case base rebuild and case base edit

Feature-free
retain-misclassified; monthly, case base edit

Feature-Based ECUE
Email Email Email Email Feature Extraction
Casebase

Feature Selection

Casebase

Case Base Edit


Casebase

Results - % Error
Dataset A 19.2% 15.4% Dataset B 15.9% 12.6% 4.5% 6.1% 2.3% 2.6%
Feature-Free Feature-Free: retainmisclassifieds, monthly edit

Feature-Based

Feature-Based: retainmisclassifieds, monthly reselect & edit

Results - % False Positives


20.0% 14.7% Dataset A Dataset B

2.0% 2.4%
Feature-Based Feature-Based: retainmisclassifieds, monthly reselect & edit

4.0% 0.7%
Feature-Free

0.9%

2.5%

Feature-Free: retainmisclassifieds, monthly edit

Overview
Case-Based Spam Filtering Case Base Maintenance Concept Drift Conclusions
Feature-Based & Feature-Free Experiments I
Competence-Based Editing Experiments II Incremental & periodic solutions Experiments III

Feature-Free ECUE: Advantages


Accuracy
lower error rate than traditional feature-based methods often lower false positive rate
it uses the raw text no need to extract, select or weight features no need to update features as spam changes simple retention/forgetting policies can be effective

Costs

Concept drift

Feature-Free ECUE: Disadvantages


No justification factors to explain results or drive adaptation Higher computation time
Time to classify email (with cb of 1000) Feature-free = 2 secs Feature-based = .01 sec

Not a metric

Future Work
Investigating algorithms to speed up retrieval time Application of measure to text other than emails

Thank you for your attention!

Spare slides

Normalized Compression Distance (NCD)


C ( xy ) min( C ( x), C ( y )) NCD( x, y ) max( C ( x), C ( y ))
Max value = 1 + (furthest) Min value = 0 (nearest) However
NCD(x,x) 0; NCD(x,y) NCD(y,x); NCD(x,y) + NCD(y,z) NCD(x,z)

[Li et al 2003]

Comparing Compression Algorithms


The better the compression the better the measure? Compared GZip with Prediction by Partial Matching (PPM)
GZip = Lempel-Ziv variant PPM = adaptive statistical compressor

Results - % Error
Dataset 1 Dataset 2 Dataset 3 Dataset 4 2.4% 2.4% 1.4% 2.3% 1.9% 1.1% 0.2%
PPM(2)

2.1% 2.2% 1.6%

2.5% 2.0% 1.7%

0.1%
GZip

0.2%
PPM(4)

0.2%
PPM(8)

Results
Little difference in classification error
Compressor choice does not greatly matter

PPM is generally considered better at compression but on our datasets...


average of 59% compression for GZip average 57% compression for PPM

PPM computationally expensive


180 times slower than GZip

GZip Speed Up
GZip uses a 32 KByte sliding window
32KB

docX

docY

Truncate each email to 16KB Achieves speed ups of between 9.5% to 25%

You might also like