Text Classification PDF

NAIVE BAYES TEXT CLASSIFICATION NAIVE BAYES TEXT CLASSIFICATION
Christopher D. Manning, Prabhakar Raghavan and Hinrich Schtze,

Introduction to Information Retrieval Cambridge University Press 2008 Introduction to Information Retrieval, Cambridge University Press. 2008.
Chapter 13
Wei Wei
wwei@idi.ntnu.no
1
TDT4215
Naive Bayes Text Classification
Lecture series
OUTLINES OUTLINES
Int d ti n: m ti ti n nd m th ds Introduction: motivation and methods
The Text Classification Problem
Properties of Naive Bayes Properties of Naive Bayes
Feature Selection
Evalutation of Text Classification Evalutation of Text Classification
2
TDT4215
OUTLINES OUTLINES
Int d ti n: m ti ti n nd m th d Introduction: motivation and methods
Feature Selection
3
TDT4215
INTRODUCTION INTRODUCTION
Motivation for Text Classification
4 TDT4215
5
TDT4215
6
TDT4215
How could they do this?
hi b dit hire some web editors
for a small quantity of
news possibally
for a large scale of for a large scale of
online news impossible
only way: automatic only way: automatic
classification by machines
7
TDT4215
8
TDT4215
Spam or Not?
Text Classification
9
TDT4215
Methods for Text Classification Methods for Text Classification
10
TDT4215
M n l l ssifi ti n Manual classification
originally used by Yahoo!
very accurate by expert
consistent for small size problem consistent for small size problem
difficult and expensive to scale
11
TDT4215
A t m ti l ssifi ti n Automatic classification
Hand-coded rule-based systems
o complex query languages
o assign category if a document contains a o assign category if a document contains a
given boolean combination of words
o accuracy is usually very high if a rule has o accuracy is usually very high if a rule has
been carefully refined over time by a
subject expert subject expert
o building and maintaining these rules is
expensive expensive
12
TDT4215
A t m ti l ssifi ti n Automatic classification
utilizing machine learning techniques
o k-Nearest Neighbors (kNN)
o Naive Bayes (NB) o Naive Bayes (NB)
o Support Vector Machines (SVM)
some other similar methods o some other similar methods
o requires hand-classified training data
Note that: many commercial systems use a
mixture of methods
13
TDT4215
OUTLINES OUTLINES
Feature Selection
14
TDT4215
THE TEXT CLASSIFICATION PROBLEM THE TEXT CLASSIFICATION PROBLEM
Text Classification also known as Text Categorization: Text Classification also known as Text Categorization:
Given a set of classes, we seek to determine which class(es) a
given document belongs to.
An example: p
Document with only a sentance:
London is planning to organize the 2012 Olympics.
We have six classes: We have six classes:
<UK>, <China>, <car>, <coffee>, <elections>, <sports>
Determined: <UK>
15
TDT4215
An example: An example:
Document with only a sentance:
London is planning to organize the 2012 Olympics.
We have six classes:
<UK>, <China>, <car>, <coffee>, <elections>, <sports>
Determined: <UK> Determined: <UK> and <sports> Determined: <UK> Determined: <UK> and <sports>
For some documents:
exist more than one class it belongs to exist more than one class it belongs to
referred to as any-of classification problem
However,
we only consider one-of classification problem we only consider one of classification problem
a document is a member of exactly one class
16
TDT4215
A formal definition:
Given:
A description of an instance, x e X, where X is the instance
language or instance space.
A fixed set of classes:
C = {c
1
, c
2
,, c
J
}
Determine:
The category of x : (x)e C where (x) is a classification The category of x : (x)e C, where (x) is a classification
function.
: X C
17
TDT4215
OUTLINES OUTLINES
Feature Selection
18
TDT4215
Th t diff nt s t s t p n There are two different ways to set up an
NB classifier:
multinomial Naive Bayes (multinomial NB model)
multivariate Bernoulli model (Bernoulli model)
19
TDT4215
M ltin mi l N i B s Multinomial Naive Bayes
A document d being in class c is computed as:
) | ( ) ( ) | (
1
c t P c P d c P
k
n k
d
s s
[
: conditional probability of term occurring in a
1 n k
d
s s
k
t ) | ( c t P
k
prior probability conditional probability
: conditional probability of term occurring in a
document of class c
: the prior probability of a document occurring in c
k
t ) | ( c t P
k
) (c P
: tokens in document d that are part of
vocabulary used for classification
: number of such tokens in d
> <
d
n
t t t ,..., ,
2 1
d
n number of such tokens in d
d
n
20
TDT4215
M ltin mi l N i B s Multinomial Naive Bayes
A document d being in class c is computed as:
) | ( ) ( ) | (
1
c t P c P d c P
k
n k
d
s s
[
How to decide the best class in NB classification ?
1 n k
d
s s
M
A
X
How to decide the best class in NB classification ?
) | ( ) ( max arg ) | ( max arg
^ ^ ^
c t P c P d c P c
k map
H = =
Note: we do not know the parameters true values but estimate
them from training data.
) | ( ) ( g ) | ( g
1
k
n k
C c C c
map
d
s s
e e
them from training data.
21
TDT4215
How to decide the best class in NB classification ? How to decide the best class in NB classification ?
) | ( ) ( max arg ) | ( max arg
^ ^ ^
c t P c P d c P c
k
H = =
Many conditional probabilities are multiplied that will result in a
) | ( ) ( max arg ) | ( max arg
1
c t P c P d c P c
k
n k
C c C c
map
d
s s
e e
H
Many conditional probabilities are multiplied that will result in a
floating point underflow.
log(xy) = log(x) + log(y) g y g g y
Therefore, its better to perform adding logarithms:
^ ^
] ) | ( log ) ( [log max arg

1
s s
e
+ =
d
n k
k
C c
map
c t P c P c
h d i di t tk i f th l ti f f how good an indicator tk is for c the relative frequency of c
22
TDT4215
^ ^
] ) | ( log ) ( [log max arg
1
^ ^
s s
e
+ =
d
n k
k
C c
map
c t P c P c
how to estimate the parameters ?
Maximum Likelihood Estimate (MLE)
for the parameters Estimation f p
23
TDT4215
What is the Maximum Likelyhood Estimation (MLE): What is the Maximum Likelyhood Estimation (MLE):
the relative frequency and corresponds to the
most likely value of each parameter given the most likely value of each parameter given the
training data.
How?
number of documents in class c
How?
for the priors:
N
N
c P
c
= ) (
^
total number of documents
for the conditional probability:
b f f
e
=
V t
ct
ct
T
T
c t P
'
'
^
) | (
number of occurrences of
t/t in training documents
from class c
24
TDT4215
A problem with MLE A problem with MLE
what if a term that did not occur in the training
data ? data ?
for we need
0 ) | (
'
'
^
= =
e V t
ct
ct
T
T
c t P
) | ( log
^
c t P 0 ) | ( > c t P for , we need
Solution: add-one or Laplace smoothing
) | ( log c t P 0 ) | ( > c t P

+
=
+
=
ct ct
T T
c t P
^
1 1
) | (

e e
+ +
V t
ct
V t
ct
B T T
c t P
'
'
'
'
) ( 1
) | (

B=|V| B=|V|
the number of terms in vocabulary
25
TDT4215
Naive Bayes algorithm: Naive Bayes algorithm:
Training
26
TDT4215
Naive Bayes algorithm: Naive Bayes algorithm:
Testing
27 TDT4215
Question: Question:
Decide:
h th d t d5 b l i t l Chi ? whether document d5 belonging to class c=China?
28
TDT4215
Solution: Solution:
T i i
4 / 1 ) ( 4 / 3 ) (
_ ^ ^
= = c P c P
Training:
7 / 3 14 / 6 ) 6 8 /( ) 1 5 ( ) | (
^
= = + + = c Chinese P
14 / 1 ) 6 8 /( ) 1 0 ( ) | ( ) | (
^ ^
= + + = = c Japan P c Tokyo P
9 / 2 ) 6 3 /( ) 1 1 ( ) | (
_ ^
Chi P
4 / 1 ) ( , 4 / 3 ) ( = = c P c P
Testing:
9 / 2 ) 6 3 /( ) 1 1 ( ) | ( = + + = c Chinese P
9 / 2 ) 6 3 /( ) 1 1 ( ) | ( ) | (
_ ^ _ ^
c=China
29
TDT4215
Th t diff nt s t s t p n There are two different ways to set up an
NB classifier:
multinomial Naive Bayes (multinomial NB model)
multivariate Bernoulli model (Bernoulli model)
30
TDT4215
B n lli m d l Bernoulli model
different with multinomial NB model:
different estimation strategies different estimation strategies
different classification rules different classification rules
31
TDT4215
training for
prior probability
are same
fraction of tokens in c containing t
fraction of documents in c containing t
32
TDT4215
only considering terms that
appears in the documents pp
t t till ff t nonoccurrent terms still affect
the computing
33
TDT4215
Question with Bernoulli model: Question with Bernoulli model:
Decide:
h th d t d5 b l i t l Chi ? whether document d5 belonging to class c=China?
34
TDT4215
Solution with Bernoulli model: Solution with Bernoulli model:
T i i / ) /( ) ( ) | (
^
h 4 / 1 ) ( 4 / 3 ) (
_ ^ ^
P P Training: 5 / 4 ) 2 3 /( ) 1 3 ( ) | ( = + + = c Chinese P
5 / 1 ) 2 3 /( ) 1 0 ( ) | ( ) | (
^ ^
3 / 2 ) 2 1 /( ) 1 1 ( ) | (
_ ^
= + + = c Chinese P 3 / 2 ) 2 1 /( ) 1 1 ( ) | ( ) | (
_ ^ _ ^
4 / 1 ) ( , 4 / 3 ) ( = = c P c P
5 / 2 ) 2 3 /( ) 1 1 ( ) | ( ) | (
^ ^
= + + = = c Macao P c Beijing P
Testing:
3 / 2 ) 2 1 /( ) 1 1 ( ) | ( = + + = c Chinese P 3 / 2 ) 2 1 /( ) 1 1 ( ) | ( ) | ( = + + = = c Japan P c Tokyo P
3 / 1 ) 2 1 /( ) 1 0 ( ) | ( ) | ( ) | (
_ ^ _ ^ _ ^
= + + = = = c Shanghai P c Macao P c Beijing P
005 . 0 ) | (
5
^
~ d c P
not-China
022 . 0 ) | (
5
_
~ d c P
35
TDT4215
OUTLINES OUTLINES
Feature Selection
36
TDT4215
PROPERTIES OF NAIVE BAYES PROPERTIES OF NAIVE BAYES
R ll B s l : Recall Bayes rule:
) | ( ) ( ) | ( ) ( ) ( B A P B P A B P A P AB P = = ) | ( ) ( ) | ( ) ( ) ( B A P B P A B P A P AB P = =
) | ( ) (
) | (
B A P B P
A B P

) (
) | ( ) (
) | (
A P
A B P =
37
TDT4215
With Bayes rule for a document d and a class c: With Bayes rule, for a document d and a class c:
) ( ) | (
) | (
c P c d P
d c P =
) (
) | (
d P
d c P =
) | ( max arg d c P c
C
map
=
Bayes rule
C ce
) (
) ( ) | (
max arg
d P
c P c d P
=
P(d) do not
affect the result
) (
g
d P C ce
) ( ) | ( max arg c P c d P = ) ( ) | ( g
C ce
38
TDT4215
) ( ) | ( P d P ) ( ) | ( max arg c P c d P c
C c
map
e
=
high time complexity to
compute both conditional
How to compute P(d|c):
compute both conditional
probabilities
Multinomial:
is the sequence of terms as it occurs in d
) | ,..., ,..., ( ) | (
1
c t t t P c d P
d
n k
> < =
> <
d
n k
t t t ,..., ,...,
1
Bernoulli:
is a binary vector of dimensionality M that
) | ,..., ,..., ( ) | (
1
c e e e P c d P
M k
> < =
> <
M k
e e e ,..., ,...,
1
is a binary vector of dimensionality M that
indicates for each term whether it occurs in d or not
> <
M k
e e e ,..., ,...,
1
39
TDT4215
C nditi n l Ind p nd n Ass mpti n Conditional Independence Assumption
probability that in a document of class
th t t ill i iti k
Multinomial:
[
= = > < =
k k k
c t X P c t t t P c d P
1
) | ( ) | ( ) | (
c the term t will occur in position k
Bernoulli:
[
s s
> <
d
d
n k
k k n k
1
1
) | ( ) | ,..., ,..., ( ) | (
Bernoulli:
[
< <
= = > < =
M i
i i M k
c e U P c e e e P c d P
1
1
) | ( ) | ,..., ,..., ( ) | (
< < M i 1
probability that a document of class c the term ti
- will occur if ei=1
ill if i 0 - will not occur if ei=0
40
TDT4215
M ltin mi l:
Multinomial:
[
X P P d P ) | ( ) | ( ) | (
c the term t will occur in position k
still high time complexity if we have to
[
s s
= = > < =
d
d
n k
k k n k
1
1
) | ( ) | ,..., ,..., ( ) | (
still high time complexity if we have to
consider the position of each term t occurs
Positional Independence Assumption Positional Independence Assumption
) | ( ) | ( c t X P c t X P
k k
= = =
Equivalent to bag of words model
) | ( ) | (
2 1
k k
q g
41
TDT4215
OUTLINES OUTLINES
Feature Selection
42
TDT4215
FEATURE SELECTION FEATURE SELECTION
Feature Selection is a process of selecting a subset Feature Selection is a process of selecting a subset
of the terms occurring in the training set and using
only this subset as features in text classification. only this subset as features in text classification.
Feature Selection: Why ?
Text collections have a large number of features g
o 10,000 1, 000, 000 unique words and more
May make using a particular classifier feasible
o Some classifiers cant deal with 100,000 of features
Reduces training time
o Training time for some methods is quadratic or worse in o Training time for some methods is quadratic or worse in
the number of features
Can improve generalization
o Eliminates noise features and avoid overfitting
43
TDT4215
Feature Selection: How ? Feature Selection: How ?
A(t,c) utility measures:
f frequency
mutual information
2
the test
2
_
44
TDT4215
F n b s d f t s l ti n Frequency-based feature selection
selecting terms that are most common in the class
simple and easy to implement
may select some frequent terms that have no
specific information (such as, Monday, Tuesday ) p f f m ( , y, y )
however, if many thousands of features are
selected, it usually does well.
45
TDT4215
Mutual Information feature selection: ) ( ) ( C U I t A Mutual Information feature selection: ) ; ( ) , ( C U I c t A =
U is a random variable
th d t t i t
1
o : the document contains t
o : the document does not contain t
C i d i bl
1 =
t
e
0 =
t
e
C is a random variable
o : the document is in class c
h d l
1 =
c
e
0
o : the document is not in class c
0 =
c
e
46
TDT4215
With Maximum Likelyhood Estimation: With Maximum Likelyhood Estimation:
number of documents that do NOT
contain t, but in c
number of documents that
contain t and in c
number of documents that
contain t, but NOT in c
number of documents
that does NOT contain
t and NOT in c
11 10 1
N N N + =
- number of documents that contain t
11 10 . 1
N N N +
11 01 1 .
N N N + =
00 01 . 0
N N N + =
N N N + =
- number of documents that contain t
- number of documents in c
- number of documents that do NOT contain t
numb f d cum nts NOT in c
00 10 0 .
N N N + =
11 10 01 00
N N N N N + + + =
- number of documents NOT in c
- total number of documents
47
TDT4215
An Example An Example
In Reuters-RCV1, c = poultry, t = export
48
TDT4215
The figure shows terms with high mutual information scores The figure shows terms with high mutual information scores
for the six classes in Reuters-RCV1.
49
TDT4215
Th f t s l ti n
2
_
The feature selection
In statistics, the text is applied to test
h d d f
_
2
_
the independence of two events.
Events A and B are defined to be
independence if
P(AB)=P(A)P(B) or P(AB) P(A)P(B) or
P(A|B)=P(A) and P(B|A)=P(B)
In feature selection the two events are In feature selection, the two events are
occurrence of the term and occurrence of
class class.
50
TDT4215
2
_
_
h th i i M t l I f ti
N
has the same meaning as in Mutual Information
feature selection.
is the expected frequency of t and c occurring
c t
e e
N
E
is the expected frequency of t and c occurring
together in a document assuming that term and class
are independent.
c t
e e
E
are independent.
51
TDT4215
2
_
_
b t d f th
N N N N N
can be counted from the
training data set as in Mutual Information feature
selection

11 10 01 00
, , , N N N N N
c t
e e
selection.
can also be computed from the
training data set.

11 10 01 00
, , , E E E E E
c t
e e
training data set.
52
TDT4215
The Example again: The Example again:
t compute :
C t th i th
11
E
Compute other in the same way:
c t
e e
E
the higher the value the
more dependence between more dependence between
term t and class c
53
TDT4215
OUTLINES OUTLINES
Feature Selection
54
TDT4215
EVALUATION OF TEXT CLASSIFICATION EVALUATION OF TEXT CLASSIFICATION
Evaluation must be done on test data that are Evaluation must be done on test data that are
independent of the training data (usually a disjoint
set of instances) set of instances)
Classification accuracy: c/n
n is the total number of test instances n is the total number of test instances
c is the number of test instances correctly
classified f
Accuracy measurement is appropriate only if
percentage of documents in the class is high p g g
A class with relative frequency 1%, the always
no classifier will achieve 99% accurate
55
TDT4215
SUMMARY SUMMARY
Feature Selection
56
TDT4215

Text Classification PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Text Classification PDF

Uploaded by

Copyright:

Available Formats

NAIVE BAYES TEXT CLASSIFICATION NAIVE BAYES TEXT CLASSIFICATION

Christopher D. Manning, Prabhakar Raghavan and Hinrich Schtze,

] ) | ( log ) ( [log max arg

You might also like