TextMining04 分类

(2013)
Email:yangjw@pku.edu.cn
1
AMS

Web catalogs
Yahoo
Open
Directoryhttp://www.dmoz.org/
Open Directory
http://www.dmoz.org/
(binary)
(multi-class)2

(multi-label)
(80)
and
MEDLINE(National Library of Medicine)

$2 million/year for manual indexing of journal articles
(40%)

()

(85%)

Text Categorization (TC)
f :A B
B
8
(late 1980s)
10
()
11
1990
2200
232504
$15
million if fully done by hand
Expert
System AIOCS
Development time: 192 person-months (2 people, 8 years)
Accuracy = 47%
(Creecy 92: 1-NN)

Development time: 4 person-months; Accuracy = 60%

12
13
14
1.
2.
3.
4.
5.
15
16
17
18

Given: Collection of example news stories
already labeled with a category (topic).
Task: Predict category for news stories not
yet labeled.
For our example, well only get to see the
headline of the news story.
Well represent categories using colors.
(All examples with the same color belong to
the same category.)
20
Amatil
Proposes
Two-forFive Bonus
Share Issue
Citibank
Norway
Unit Loses
Six Mln
Crowns in
1986
AnheuserBusch
Joins Bid
for San
Miguel
Italys La
Fondiaria
to Report
Higher
1986
Profits
Japan
Ministry Says
Open Farm
Trade Would
Hit U.S.
Isuzu Plans
No Interim
Dividend
Vieille
Montagne
Says 1986
Conditions
Unfavourable
Senator
Defends U.S.
Mandatory
Farm Control
Bill
Jardine
Matheson
Said It Sets
Two-for-Five
Bonus Issue
Replacing B
Shares
Bowater
Industries
Profit
Exceed
Expectations
21
Amatil Proposes
Two-for-Five
Bonus Share Issue
Citibank Norway
Unit Loses Six Mln
Crowns in 1986
Japan Ministry
Says Open Farm
Trade Would Hit
U.S.
Vieille Montagne
Says 1986
Conditions
Unfavourable
Jardine Matheson
Said It Sets Twofor-Five Bonus
Issue Replacing B
Shares
Anheuser-Busch
Joins Bid for San
Miguel
Italys La Fondiaria
to Report Higher
1986 Profits
Isuzu Plans No
Interim Dividend
Senator Defends
U.S. Mandatory
Farm Control Bill
Bowater Industries
Profit Exceed
Expectations
22
Senate
Panel
Studies
Loan Rate,
Set Aside
Plans
Amatil Proposes
Two-for-Five
Bonus Share Issue
Citibank Norway
Unit Loses Six Mln
Crowns in 1986
Japan Ministry
Says Open Farm
Trade Would Hit
U.S.
Vieille Montagne
Says 1986
Conditions
Unfavourable
Jardine Matheson
Issue Replacing B
Shares
Anheuser-Busch
Joins Bid for San
Miguel
Italys La Fondiaria
to Report Higher
1986 Profits
Isuzu Plans No
Interim Dividend
Senator Defends
U.S. Mandatory
Farm Control Bill
Bowater Industries
Profit Exceed
Expectations
23
Senate
Panel
Studies
Loan Rate,
Set Aside
Plans
Amatil Proposes
Two-for-Five
Bonus Share Issue
Citibank Norway
Unit Loses Six Mln
Crowns in 1986
Japan Ministry
Says Open Farm
Trade Would Hit
U.S.
Vieille Montagne
Says 1986
Conditions
Unfavourable
Jardine Matheson
Issue Replacing B
Shares
Anheuser-Busch
Joins Bid for San
Miguel
Italys La Fondiaria
to Report Higher
1986 Profits
Isuzu Plans No
Interim Dividend
Senator Defends
U.S. Mandatory
Farm Control Bill
Bowater Industries
Profit Exceed
Expectations
24
25
P, precision
R, recall
FMeasure
1
1
1
1
P
R
2 PR
F1
PR
26
Precision=a/(a+b)
Recall=a/(a+c),
miss
rate=1-recall
accuracy=(a+d)/(a+b+c+d),
error=(b+c)/(a+b+c+d)=1-accuracy
fallout=b/(b+d)=false alarm rate,
F=(2+1)pr/(2p+r)
4
2
Break Even Point, BEP, p=r
1
5
interpolated 11 point average precision (p-r)
27
Fi
Macro
2 Pi Ri
F1i
Pi Ri
1 m
Macro F Fi
m i 1
Micro
Micro F
1
1
1
Pi
Ri
(ni Fi )
i 1
ni
i 1
28
29
30
(feature extraction)
htmltag
(stop
words)(stemming)
()
(TF, DF)

(Feature Selection)
(Re-parameterisationLSA)
31
(Vector Space Model)

Mti ()///
dj
(a1j,a2j,,aMj)
N
AM*N= (aij)
T3
Cosine

D1 = 2T1+ 3T2 + 5T3
Q = 0T1 + 0T2 + 2T3

2 3
D2 = 3T1 + 7T2 + T3
T2
T1
32
Term
Character
Word
Phrase
Concept

clusterword cluster/
N-gramN
window
David Lewis()
Words
33
(boolean weighting)
aij=1(TFij>0)
tf term frequency
or 0 (TFij=0)
TFIDF
The word is more important if it

appear several times in a target
document
TF: aij=TFij
TF*IDF: aij=TFij*log(N/DFi)
LTC: TF
TFC:
aij
aij
TFij * log( N / DFi )
[TF
kj
* log( N / DFk )]2
log( TFij 1.0) * log( N / DFi )

2
[log(
TF
1
.
0
)
*
log(
N
/
DF
)]
kj
k
k
34
35
over
fitting
Over-fitting problem

N10
stop words
36

DF, document frequency
information gain
mutual information
2
The test , chi-square
[kai:]
37
--DF
DF

TermDF

ad hoc
38

Contingency table (matrix)
39
--
entropy
p = {p1, p2, , pm}

Entropy(t ) Pi log Pi
Pi ci / c
Pi ( A B) / n
n A B C D
40
--
tci
P(ci|t)
A/(A+C)
ci
Entropy(t ) P(ci | t ) log P(ci | t )

i
term

41
--IG
(Information Gain, IG)

term
t

Gain(t) Entropy(S ) Expected Entropy(St )

{i 1 P(ci )log P(ci )}
M
[P(t ){i 1 P(ci | t )log P(ci | t )}

M
P(t ){i 1 P(ci | t )log P(ci | t )}]

M
42
--MI
MIXY
P ( x, y )
MI ( X , Y ) P( x, y ) log
P( x) P( y )
x
y
P (x,y)x,y
43
--MI
X, Yt
t0,1,2,
mc={c1,c2,,cm}
t, c
P(t), P(c), P(t,c)

44
--MI
(Mutual
Information)MI
tc
c
~c
(N=A+B+C+D)
P (t c)
I (t , c) log
P(t ) P(c)
~t
A/ N
A N
log
log
(( A B) / N ) * (( A C ) / N )
( A B )( A C )
m
I AVG (t ) P (ci ) I (t , ci )
i 1
I MAX (t ) max
m
i 1
P(ci ) I (t , ci )
45
--MI

MI(t,C)tC
P(t|c) t
MI
46
--
2
(chi-square)

chi-square(t,c)
47
--
2
2
(term)
2
AD<BC,,
N=A+B+C+D
2
N
(
AD
CB
)
2 (t , c)
( A C )( B D)( A B)(C D)
2 AVG (t ) P(ci ) 2 (t , ci )
i 1
MAX
(t ) max { (t , ci )}
m
i 1
c
A
~c
B
~t
48

DF:
Document Frequency
IG: Information Gain
m
G(t) = i1 Pr (ci ) log Pr (ci )
p r (t) i 1 Pr (ci | t ) log Pr (ci | t ) Pr (t )i 1 Pr (ci | t ) log Pr (ci | t )
m
MI:
Mutual Information
A N
I (t , c) log
( A C ) ( A B)
CHI:
(t , c)
2
N ( ADCB ) 2
( A C ) ( B D ) ( A B ) ( C D )
49
(kNN)
CHI.max
IG
DF
TS
MI.max
50
(LLSF)
CHI.max
IG
DF
TS
MI.max
51
52
53
Independency Binary
DTree
NB
NN
NNet
M-ary
Rocchio
SVM
LLSF
KNN
WORD
54

Decision
Trees
KNN(K-Nearest Neighbour)
Bayes Network
Neural Networks
Boosting
SVM
55

age?
<=30
student?
overcast
30..40
yes
>40
credit rating?
no
yes
excellent
fair
no
yes
no
yes
56
CART
C4.5
(ID3)
CHAID
(pruning)
57
Attribute Selection Measure:

Information Gain(ID3/C4.5)
S contains si tuples of class Ci for i = {1, ,
m}
information measures info required to
classify any arbitrary tuple
si
si
I( s1,s2,...,sm) log 2
s
i 1 s
58
Attribute Selection Measure:

Information Gain(ID3/C4.5)
entropy
of attribute A with values

{a1,a2,,av}
s1 j ... smj
E(A)
I ( s1 j ,..., smj )
s
j 1
v
information
gained by branching on
attribute A
Gain(A) I(s 1, s 2 ,..., sm) E(A)
59
Rocchio
Rocchio
w jc w jc
'
iC
xij
nC
iC
n nC
CSVc ( d i ) w c x i
xij
w x
w x
cj
ij
cj
ij

60
K-NN
61
1-Nearest Neighbor (graphically)
62
kNN
Lazy Learning, Example-based Learning
k=1, A
k=5B
k=10B
63
kNN
Instance-Based Learning, Lazy Learning

Well-known approach to pattern recognition
Initially by Fix and Hodges (1951)
Theoretical error bound analysis by Duda & Hart
(1957)
Applied to text categorization in early 90sYYM
Among top-performing methods in TC evaluations
Scalable to large TC applications
64
kNN for Text Categorization

(Yang YM, SIGIR-1994)
Represent documents as points (vectors).

Define a similarity measure for pair wise documents.
Tune parameter k for optimizing classification
effectiveness.
Choose a voting scheme (e.g., weighted sum) for
scoring categories
Threshold on the scores for classification decisions
.
65
Nearest Neighbor
Similar item: We need a functional
definition of similarity if we
want to apply this automatically.

Does each neighbor get the same weight?
66
67
K-NN using a weighted-sum

voting Scheme
68
Category Scoring for

Weighted-Sum
The score for a category is the sum of the similarity

scores between the point to be classified and all of
its k-neighbors that belong to the given category.
score(c | x) bc
sim( x, d ) I (d , c)
d kNN of x
iff:
where
x
is the new point; c is a class

d is a classified point among the k-nearest neighbors of x;
sim(x,d) is the similarity between x and d;
I(d,c) = 1 iff point d belongs to class c;
I(d,c) = 0 otherwise.
69
kNNk
70
kNNk
71
kNN
(among top-5 in benchmark evaluations)

Web

learning,

kNNlazy
72
The n-dimensional input vector x is mapped into

variable y by means of the scalar product and a
nonlinear function mapping
- mk
x0
w0
x1
w1
xn
f
output y
wn
Input
weight
vector x vector w
weighted
sum
Activation
function
73
Bayesian
P( c j | d i )
P( d i | c j ) P( c j )
r
P( d i )
P( d i | c j ) P( c j )
P(d i | c j ) P( wik | c j )
k 1
c j
N (c j )
1 N (c j )
P( c j )
N (ck ) | c | N (ck )
k 1
P( wi | c j )
wi c j
c j
1 N ij
N kj
k
74
Bagging

Rfi

fi()N
dR
75
Boosting
Bagging
kk-1
AdaBoost

76
SVM
77
SVM
( x1 , y1 ),...( xl , yl ) x R n y {1,1}
( w x) b 0
78
How SVMs work
79
How SVMs work
80
How SVMs work
81
How SVMs work
82
How SVMs work
83
How SVMs work
84
How SVMs work

Support vectors
margin
85
How SVMs work

Support vectors
margin
86
How SVMs work

Support vectors
margin
87
( w xi ) b 1, when : yi 1
( w xi ) b 1, when : yi 1
yi [( w xi ) b] 1 i 1,..., l
2 / || w ||
yi [( xi w) b] 1, i 1,2,..., l
1
( w) w w)
2
2
88
89

1
( w) w w)
2
yi [( xi w) b] 1, i 1,2,..., l
l
1
L( w, b, ) w w) i { yi [( xi w) b] 1} 3
2
i 1
()
l
L( w, b, )
0
i yi 0, i 1, , l
b
i 1
L( w, b, )
l
0
w yi i xi , i 1, , l
w
i 1
4
5
90

K-T()
((4)(5)(3))
l
1
L( ) w w) i { yi [( xi w) b] 1}
2
i 1
l
l
l
1
w w) i yi ( xi w) b( i yi ) i
2
i 1
i 1
i 1
(5)
l
1
i i j yi y j ( xi x j )
2 i 1 i 1
i 1
l
y
i 1
i 0,
0,
(4)
6
7
i 1,..., l
91
1
L( ) i
2
i 1
ij y i y j(x i
i
i
y i i
i
1
xj )
0; i 0, i 1,..., l
i 0
i 0
i 0
i{y i(w x i b ) 1} 0
*
*
f ( x) sgn yi i ( xi x) b

9
92
i 0
l
1
( w, ) ( w w) C i 10
2
i 1
yi [( w xi ) b] 1 i i 1,..., l 11
(6)(7)
(8)
0 i C
12
93
SVM
?
xi w b 0
?
In non linear case we can see this as K ( x , w) b 0
i
Classification using SVM (w,b)
( Z i Z ) K ( x, xi )
()
l
1 l l
W ( ) i i j yi y j K ( xi , x j )
2 i 1 i 1
i 1
94
x , y ,X
k x, y x y
:X R
XF(Hilbert)
n
95
Sigmoid
Mercers
96
L*L
Lagrange

SMO
97
SVM
VC
98
(SVR)
99
(Ranking SVM)
x21
x22
x23
x11
x12
x13
100
Regression based on Least Squares Fit (1991)

Nearest Neighbor Classification (1992) *
Bayesian Probabilistic Models (1992) *
Symbolic Rule Induction (1994)
Decision Tree (1994) *
Neural Networks (1995)
Rocchio approach (traditional IR, 1996) *
Support Vector Machines (1997) *
Boosting or Bagging (1997)*
Hierarchical Language Modeling (1998)
First-Order-Logic Rule Induction (1999)
Maximum Entropy (1999)
Hidden Markov Models (1999)
Error-Correcting Output Coding (1999)
...
101
DF,
document frequency
information gain
mutual information
The 2 test , chi-square
KNN
SVM
102
103

TextMining04 分类

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

TextMining04 分类

Uploaded by

Copyright:

Available Formats

(2013)

MEDLINE(National Library of Medicine)

Text Categorization (TC)

million if fully done by hand

(Creecy 92: 1-NN)

(Vector Space Model)

D1 = 2T1+ 3T2 + 5T3

Q = 0T1 + 0T2 + 2T3

The word is more important if it

TFij * log( N / DFi )

* log( N / DFk )]2

log( TFij 1.0) * log( N / DFi )

p = {p1, p2, , pm}

Entropy(t ) P(ci | t ) log P(ci | t )

(Information Gain, IG)

Gain(t) Entropy(S ) Expected Entropy(St )

[P(t ){i 1 P(ci | t )log P(ci | t )}

P(t ){i 1 P(ci | t )log P(ci | t )}]

P(t), P(c), P(t,c)

Attribute Selection Measure:

S contains si tuples of class Ci for i = {1, ,

Attribute Selection Measure:

of attribute A with values

1-Nearest Neighbor (graphically)

Lazy Learning, Example-based Learning

Instance-Based Learning, Lazy Learning

kNN for Text Categorization

Represent documents as points (vectors).

Does each neighbor get the same weight?

K-NN using a weighted-sum

Category Scoring for

The score for a category is the sum of the similarity

is the new point; c is a class

(among top-5 in benchmark evaluations)

The n-dimensional input vector x is mapped into

How SVMs work

How SVMs work

How SVMs work

How SVMs work

How SVMs work

How SVMs work

How SVMs work

How SVMs work

How SVMs work

Classification using SVM (w,b)

Regression based on Least Squares Fit (1991)

You might also like