Professional Documents
Culture Documents
Derek Bridge
Overview
Introduction Case-Based Spam Filtering
Feature-Based Feature-Free Experiments I
Introduction
Spam Filtering
Spam filtering is classification:
is an incoming email ham or spam?
procedural
whitelists, blacklists, challenge-response systems,
Spam filters
collaborative
sharing signatures
content-based
rules, decision trees, probabilities, case bases,
hybrid.
drift).
Overview
Introduction Case-Based Spam Filtering
Feature-Based Feature-Free Experiments I
Case-Based Reasoning
New problem RETRIEVE
Retrieved Case
REUSE
Adapted Case
Case-Based Reasoning
MAINTAIN
Spam is subjective and personal; It is heterogeneous; There is a high costs to false positives (where ham is classified as spam); and It is constantly changing (concept
drift).
Users We It Case is can can known bases bias have that CBR can individual away CBR be case updated from handles bases false incrementally disjunctive created positives from their concepts own emails well
Overview
Introduction Case-Based Spam Filtering
Feature-Based Feature-Free Experiments I
Feature-Based ECUE
Email Email Email Email Feature Extraction
Casebase
Feature-Based ECUE
Email Email Email Email Feature Extraction
Casebase
Feature Selection
Casebase
Feature-Based ECUE
Email Email Email Email Feature Extraction
Casebase
Feature Selection
Casebase
Case Selection
Casebase
Feature-Based ECUE
Email Email Email Email Feature Extraction
Casebase
Feature Selection
Runtime System
Casebase New Case
Classification
Casebase
Case Selection
spam!
Feature-Based ECUE
The distance between cases is a count of the number of features that they do not share Nave Bayes classifier thought to be among the best for spam filtering Feature-Based ECUE has comparable, and sometimes slightly better, accuracy than Nave Bayes
Overview
Introduction Case-Based Spam Filtering
Feature-Based Feature-Free Experiments I
Feature-Free ECUE
Alternative to Feature-Based ECUE Inspired by theory of Kolmogorov
Complexity
K(x) = size of smallest Turing machine that can output x to its tape K(x|y) = size of smallest Turing machine that can output x when given y
Feature-Free ECUE
Approximate K(x) by C(x)
C(x) = size of x after compression
Using Compression
Consider length of two documents allowing for inter-document redundancy
= len(gzip( = len(gzip( = len( = C(xy) docX docX docX docY + docY )) ))
docY )
Using Compression
Consider length of two documents not allowing for inter-document redundancy
= len(gzip(
docX
)) + len(gzip(
docY
))
= len(
docX
) + len(
docY
= C(x) + C(y)
[Keogh et al 2004]
Feature-Based ECUE
Email Email Email Email Feature Extraction
Casebase
Feature Selection
Runtime System
Casebase New Email
Classification
Casebase
spam!
Feature-Free ECUE
spam!
Experiments I
Created 4 datasets of 1000 emails from two years of email from two people 10-fold cross-validation Settings:
each dataset has 500 consecutive ham, 500 consecutive spam
Measures:
k=3 Feature-based: 700 features Feature-free: GZip as text compressor FPRate = #false positives/#ham FNRate = #false negatives/#spam Err = (FPRate + FNRate) / 2
Results - % Error
13.2% Dataset 1 Dataset 2 9.8% Dataset 3 Dataset 4
1.4%
1.0%
1.2%
Feature-Based
Feature-Free (GZip)
Overview
Case-Based Spam Filtering Case Base Maintenance Concept Drift Conclusions
Feature-Based & Feature-Free Experiments I
Competence-Based Editing Experiments II Incremental & periodic solutions Experiments III
Their goal is to
reduce retrieval time but maintain or even improve accuracy.
Competence Model
For each case c, compute
coverage set of c liability set of c
cases that have c as one of their k-NN and which have same class as c cases that have c as one of their k-NN and which have different class from c
x
y c x is in coverage set of c y is in liability set of c
Competence-Based Editing
Blame-Based Noise Reduction
For each case c with non-empty liability set (taken in descending order of size of liability set), This emphasises removal of cases that cause misclassifications.
if the cases in cs coverage set can still be correctly classified without c, then c can be deleted.
For each remaining case c (taken in ascending order of size of coverage set) This retains cases close to class boundaries
retain c but delete the cases in cs coverage set
Results - % Error
9.8% 7.0% 5.7% 3.8% 2.4% 2.2% 2.6% 2.2% Dataset 1 Dataset 3
FeatureBased (full)
FeatureBased (edited)
Feature-based edited size = 75% and 65% Feature-free edited size = 59% and 57%
1.4% 0.8%
1.0% 0.4%
FeatureBased (full)
FeatureBased (edited)
Overview
Case-Based Spam Filtering Case Base Maintenance Concept Drift Conclusions
Feature-Based & Feature-Free Experiments I
Competence-Based Editing Experiments II Incremental & periodic solutions Experiments III
Concept Drift
The target concept is not static
it changes according to season it changes according to world events peoples interests and tolerances change there is an arms race:
ever more devious spamouflage!
Experiments III
Took ~10000 emails from two years of email from two people in date-order Created a case base for each person from earliest 500 consecutive ham & earliest 500 consecutive spam Remaining ~9000 emails presented chronologically as test cases Same settings and measures as before
k=3 Feature-based: 700 features Feature-free: GZip as text compressor
Retention policies
CBR (and other lazy learners) can easily incorporate the most recent examples
retain-all: store all new emails in the case base retain-misclassifieds: store a new email if our prediction is wrong
Results - % Error
15.9% Dataset A 12.6% Dataset B
2.3%
3.2%
Feature-Free (GZip)
1.5% 0.7%
Feature-Free (GZip)
Retention
Bigger case base reduces efficiency Obsolete cases may reduce accuracy Obsolete features may reduce accuracy Need a deletion policy
Incremental Solutions
Consider add-1-delete-1
Case base size remains constant retention policy
retain-all retain-misclassified
forgetting policy
forget-oldest forget-least-accurate
Incremental Solutions
Consider add-1-delete-1
Case base size remains constant retention policy
retain-all retain-misclassified #successes #retrievals
forgetting policy
forget-oldest forget-least-accurate
Accuracy =
Results - % Error
15.9% 12.6% Dataset A Dataset B
3.2% 2.3%
Feature-Free Feature-Free: retainmisclassifieds, forget-oldest
2.8% 1.7%
Feature-Free: retain-all, forgetoldest
4.0% 1.8%
Feature-Free: retainmisclassifieds, forget-leastaccurate
3.0% 1.9%
Feature-Free: retain-all, forgetleast-accurate
0.7%
Feature-Free
Periodic Solutions
Periodic
Feature-based:
retain-misclassified; monthly, feature re-extraction, feature reselection, case base rebuild and case base edit
Feature-free
retain-misclassified; monthly, case base edit
Feature-Based ECUE
Email Email Email Email Feature Extraction
Casebase
Feature Selection
Casebase
Results - % Error
Dataset A 19.2% 15.4% Dataset B 15.9% 12.6% 4.5% 6.1% 2.3% 2.6%
Feature-Free Feature-Free: retainmisclassifieds, monthly edit
Feature-Based
2.0% 2.4%
Feature-Based Feature-Based: retainmisclassifieds, monthly reselect & edit
4.0% 0.7%
Feature-Free
0.9%
2.5%
Overview
Case-Based Spam Filtering Case Base Maintenance Concept Drift Conclusions
Feature-Based & Feature-Free Experiments I
Competence-Based Editing Experiments II Incremental & periodic solutions Experiments III
Costs
Concept drift
Not a metric
Future Work
Investigating algorithms to speed up retrieval time Application of measure to text other than emails
Spare slides
[Li et al 2003]
Results - % Error
Dataset 1 Dataset 2 Dataset 3 Dataset 4 2.4% 2.4% 1.4% 2.3% 1.9% 1.1% 0.2%
PPM(2)
0.1%
GZip
0.2%
PPM(4)
0.2%
PPM(8)
Results
Little difference in classification error
Compressor choice does not greatly matter
GZip Speed Up
GZip uses a 32 KByte sliding window
32KB
docX
docY
Truncate each email to 16KB Achieves speed ups of between 9.5% to 25%