You are on page 1of 103

(2013)

Email:yangjw@pku.edu.cn
1

AMS

Web catalogs
Yahoo
Open

Directoryhttp://www.dmoz.org/

Open Directory
http://www.dmoz.org/

(binary)

(multi-class)2


(multi-label)

(80)
and

MEDLINE(National Library of Medicine)


$2 million/year for manual indexing of journal articles
(40%)

()

(85%)

Text Categorization (TC)

f :A B

B
8

(late 1980s)

10

()

11

1990
2200
232504
$15

million if fully done by hand

Expert

System AIOCS
Development time: 192 person-months (2 people, 8 years)
Accuracy = 47%

(Creecy 92: 1-NN)


Development time: 4 person-months; Accuracy = 60%

12

13

14

1.
2.

3.
4.
5.

15

16

17

18


Given: Collection of example news stories
already labeled with a category (topic).
Task: Predict category for news stories not
yet labeled.
For our example, well only get to see the
headline of the news story.
Well represent categories using colors.
(All examples with the same color belong to
the same category.)

20

Amatil
Proposes
Two-forFive Bonus
Share Issue

Citibank
Norway
Unit Loses
Six Mln
Crowns in
1986

AnheuserBusch
Joins Bid
for San
Miguel

Italys La
Fondiaria
to Report
Higher
1986
Profits

Japan
Ministry Says
Open Farm
Trade Would
Hit U.S.

Isuzu Plans
No Interim
Dividend

Vieille
Montagne
Says 1986
Conditions
Unfavourable

Senator
Defends U.S.
Mandatory
Farm Control
Bill

Jardine
Matheson
Said It Sets
Two-for-Five
Bonus Issue
Replacing B
Shares

Bowater
Industries
Profit
Exceed
Expectations
21

Amatil Proposes
Two-for-Five
Bonus Share Issue

Citibank Norway
Unit Loses Six Mln
Crowns in 1986

Japan Ministry
Says Open Farm
Trade Would Hit
U.S.

Vieille Montagne
Says 1986
Conditions
Unfavourable

Jardine Matheson
Said It Sets Twofor-Five Bonus
Issue Replacing B
Shares

Anheuser-Busch
Joins Bid for San
Miguel

Italys La Fondiaria
to Report Higher
1986 Profits

Isuzu Plans No
Interim Dividend

Senator Defends
U.S. Mandatory
Farm Control Bill

Bowater Industries
Profit Exceed
Expectations

22

Senate
Panel
Studies
Loan Rate,
Set Aside
Plans

Amatil Proposes
Two-for-Five
Bonus Share Issue

Citibank Norway
Unit Loses Six Mln
Crowns in 1986

Japan Ministry
Says Open Farm
Trade Would Hit
U.S.

Vieille Montagne
Says 1986
Conditions
Unfavourable

Jardine Matheson
Said It Sets Twofor-Five Bonus
Issue Replacing B
Shares

Anheuser-Busch
Joins Bid for San
Miguel

Italys La Fondiaria
to Report Higher
1986 Profits

Isuzu Plans No
Interim Dividend

Senator Defends
U.S. Mandatory
Farm Control Bill

Bowater Industries
Profit Exceed
Expectations

23

Senate
Panel
Studies
Loan Rate,
Set Aside
Plans

Amatil Proposes
Two-for-Five
Bonus Share Issue

Citibank Norway
Unit Loses Six Mln
Crowns in 1986

Japan Ministry
Says Open Farm
Trade Would Hit
U.S.

Vieille Montagne
Says 1986
Conditions
Unfavourable

Jardine Matheson
Said It Sets Twofor-Five Bonus
Issue Replacing B
Shares

Anheuser-Busch
Joins Bid for San
Miguel

Italys La Fondiaria
to Report Higher
1986 Profits

Isuzu Plans No
Interim Dividend

Senator Defends
U.S. Mandatory
Farm Control Bill

Bowater Industries
Profit Exceed
Expectations

24

25

P, precision
R, recall
FMeasure

1
1
1

1
P
R

2 PR
F1
PR
26

Precision=a/(a+b)
Recall=a/(a+c),
miss

rate=1-recall
accuracy=(a+d)/(a+b+c+d),
error=(b+c)/(a+b+c+d)=1-accuracy
fallout=b/(b+d)=false alarm rate,
F=(2+1)pr/(2p+r)
4
2
Break Even Point, BEP, p=r
1
5
interpolated 11 point average precision (p-r)
27

Fi

Macro

2 Pi Ri
F1i
Pi Ri

1 m
Macro F Fi
m i 1

Micro
Micro F

1
1
1
Pi
Ri

(ni Fi )

i 1

ni

i 1

28

29

30

(feature extraction)

htmltag
(stop

words)(stemming)
()
(TF, DF)

(Feature Selection)
(Re-parameterisationLSA)
31

(Vector Space Model)


Mti ()///
dj

(a1j,a2j,,aMj)
N

AM*N= (aij)

T3

Cosine

D1 = 2T1+ 3T2 + 5T3

Q = 0T1 + 0T2 + 2T3


2 3

D2 = 3T1 + 7T2 + T3

T2

T1
32

Term

Character
Word
Phrase
Concept


clusterword cluster/

N-gramN
window

David Lewis()
Words
33

(boolean weighting)
aij=1(TFij>0)

tf term frequency

or 0 (TFij=0)

TFIDF

The word is more important if it


appear several times in a target
document

TF: aij=TFij
TF*IDF: aij=TFij*log(N/DFi)

LTC: TF
TFC:

aij

aij

TFij * log( N / DFi )

[TF

kj

* log( N / DFk )]2

log( TFij 1.0) * log( N / DFi )


2
[log(
TF

1
.
0
)
*
log(
N
/
DF
)]

kj
k
k

34

35

over

fitting

Over-fitting problem


N10

stop words

36


DF, document frequency
information gain
mutual information
2
The test , chi-square

[kai:]

37

--DF

DF

TermDF


ad hoc

38


Contingency table (matrix)

39

--

entropy

p = {p1, p2, , pm}


Entropy(t ) Pi log Pi
Pi ci / c

Pi ( A B) / n

n A B C D

40

--

tci
P(ci|t)
A/(A+C)
ci

Entropy(t ) P(ci | t ) log P(ci | t )


i

term


41

--IG

(Information Gain, IG)


term
t

Gain(t) Entropy(S ) Expected Entropy(St )


{i 1 P(ci )log P(ci )}
M

[P(t ){i 1 P(ci | t )log P(ci | t )}


M

P(t ){i 1 P(ci | t )log P(ci | t )}]


M

42

--MI

MIXY

P ( x, y )
MI ( X , Y ) P( x, y ) log
P( x) P( y )
x
y
P (x,y)x,y

43

--MI

X, Yt

t0,1,2,

mc={c1,c2,,cm}
t, c

P(t), P(c), P(t,c)



44

--MI
(Mutual

Information)MI
tc
c
~c

(N=A+B+C+D)

P (t c)
I (t , c) log
P(t ) P(c)

~t

A/ N
A N
log
log
(( A B) / N ) * (( A C ) / N )
( A B )( A C )
m

I AVG (t ) P (ci ) I (t , ci )
i 1

I MAX (t ) max

m
i 1

P(ci ) I (t , ci )

45

--MI

MI(t,C)tC

P(t|c) t

MI

46

--
2

(chi-square)

chi-square(t,c)

47

--
2

2
(term)
2
AD<BC,,

N=A+B+C+D

2
N
(
AD

CB
)
2 (t , c)
( A C )( B D)( A B)(C D)

2 AVG (t ) P(ci ) 2 (t , ci )
i 1

MAX

(t ) max { (t , ci )}
m
i 1

c
A

~c
B

~t

48


DF:

Document Frequency
IG: Information Gain
m
G(t) = i1 Pr (ci ) log Pr (ci )
p r (t) i 1 Pr (ci | t ) log Pr (ci | t ) Pr (t )i 1 Pr (ci | t ) log Pr (ci | t )
m

MI:

Mutual Information
A N
I (t , c) log
( A C ) ( A B)

CHI:

(t , c)
2

N ( ADCB ) 2
( A C ) ( B D ) ( A B ) ( C D )

49

(kNN)
CHI.max

IG

DF
TS

MI.max

50

(LLSF)

CHI.max

IG

DF
TS

MI.max

51

52

53

Independency Binary
DTree

NB

NN

NNet

M-ary
Rocchio

SVM

LLSF

KNN

WORD

54


Decision

Trees
KNN(K-Nearest Neighbour)
Bayes Network
Neural Networks
Boosting
SVM

55


age?
<=30
student?

overcast
30..40
yes

>40
credit rating?

no

yes

excellent

fair

no

yes

no

yes

56

CART

C4.5

(ID3)
CHAID
(pruning)

57

Attribute Selection Measure:


Information Gain(ID3/C4.5)

S contains si tuples of class Ci for i = {1, ,

m}
information measures info required to
classify any arbitrary tuple

si
si
I( s1,s2,...,sm) log 2
s
i 1 s
58

Attribute Selection Measure:


Information Gain(ID3/C4.5)
entropy

of attribute A with values


{a1,a2,,av}
s1 j ... smj
E(A)
I ( s1 j ,..., smj )
s
j 1
v

information

gained by branching on

attribute A
Gain(A) I(s 1, s 2 ,..., sm) E(A)

59

Rocchio
Rocchio

w jc w jc
'

iC

xij

nC

iC

n nC

CSVc ( d i ) w c x i

xij

w x
w x
cj

ij

cj

ij


60

K-NN

61

1-Nearest Neighbor (graphically)

62

kNN

Lazy Learning, Example-based Learning

k=1, A
k=5B
k=10B

63

kNN

Instance-Based Learning, Lazy Learning


Well-known approach to pattern recognition
Initially by Fix and Hodges (1951)
Theoretical error bound analysis by Duda & Hart
(1957)
Applied to text categorization in early 90sYYM
Among top-performing methods in TC evaluations
Scalable to large TC applications

64

kNN for Text Categorization


(Yang YM, SIGIR-1994)

Represent documents as points (vectors).


Define a similarity measure for pair wise documents.
Tune parameter k for optimizing classification
effectiveness.
Choose a voting scheme (e.g., weighted sum) for
scoring categories
Threshold on the scores for classification decisions
.

65

Nearest Neighbor
Similar item: We need a functional
definition of similarity if we
want to apply this automatically.

Does each neighbor get the same weight?

66

67

K-NN using a weighted-sum


voting Scheme

68

Category Scoring for


Weighted-Sum

The score for a category is the sum of the similarity


scores between the point to be classified and all of
its k-neighbors that belong to the given category.
score(c | x) bc

sim( x, d ) I (d , c)

d kNN of x

iff:

where
x

is the new point; c is a class


d is a classified point among the k-nearest neighbors of x;
sim(x,d) is the similarity between x and d;
I(d,c) = 1 iff point d belongs to class c;

I(d,c) = 0 otherwise.

69

kNNk

70

kNNk

71

kNN

(among top-5 in benchmark evaluations)



Web

learning,



kNNlazy

72

The n-dimensional input vector x is mapped into


variable y by means of the scalar product and a
nonlinear function mapping

- mk
x0

w0

x1

w1

xn

f
output y

wn

Input
weight
vector x vector w

weighted
sum

Activation
function

73

Bayesian
P( c j | d i )

P( d i | c j ) P( c j )
r

P( d i )

P( d i | c j ) P( c j )

P(d i | c j ) P( wik | c j )
k 1

c j
N (c j )
1 N (c j )
P( c j )

N (ck ) | c | N (ck )
k 1

P( wi | c j )

wi c j
c j

1 N ij

N kj
k

74

Bagging

Rfi


fi()N

dR

75

Boosting
Bagging

kk-1

AdaBoost

76

SVM

77

SVM

( x1 , y1 ),...( xl , yl ) x R n y {1,1}

( w x) b 0

78

How SVMs work

79

How SVMs work

80

How SVMs work

81

How SVMs work

82

How SVMs work

83

How SVMs work

84

How SVMs work


Support vectors

margin

85

How SVMs work


Support vectors

margin

86

How SVMs work


Support vectors

margin

87

( w xi ) b 1, when : yi 1

( w xi ) b 1, when : yi 1

yi [( w xi ) b] 1 i 1,..., l
2 / || w ||

yi [( xi w) b] 1, i 1,2,..., l

1
( w) w w)
2

2
88

89


1
( w) w w)
2
yi [( xi w) b] 1, i 1,2,..., l

l
1
L( w, b, ) w w) i { yi [( xi w) b] 1} 3
2
i 1

()

l
L( w, b, )
0

i yi 0, i 1, , l

b
i 1
L( w, b, )
l

0
w yi i xi , i 1, , l

w
i 1

4
5
90


K-T()

((4)(5)(3))
l
1
L( ) w w) i { yi [( xi w) b] 1}
2
i 1
l
l
l
1
w w) i yi ( xi w) b( i yi ) i
2
i 1
i 1
i 1
(5)
l

1
i i j yi y j ( xi x j )
2 i 1 i 1
i 1
l

y
i 1

i 0,

0,

(4)

6
7

i 1,..., l

91

1
L( ) i
2
i 1

ij y i y j(x i

i
i

y i i

i
1

xj )

0; i 0, i 1,..., l

i 0
i 0
i 0

i{y i(w x i b ) 1} 0

*
*
f ( x) sgn yi i ( xi x) b

9
92

i 0

l
1

( w, ) ( w w) C i 10
2
i 1

yi [( w xi ) b] 1 i i 1,..., l 11

(6)(7)
(8)

0 i C

12

93

SVM
?

xi w b 0
?
In non linear case we can see this as K ( x , w) b 0
i

Classification using SVM (w,b)

( Z i Z ) K ( x, xi )
()
l
1 l l
W ( ) i i j yi y j K ( xi , x j )
2 i 1 i 1
i 1

94

x , y ,X

k x, y x y
:X R

XF(Hilbert)
n

95

Sigmoid

Mercers

96

L*L

Lagrange


SMO
97

SVM

VC

98

(SVR)

99

(Ranking SVM)
x21

x22
x23

x11
x12

x13

100

Regression based on Least Squares Fit (1991)


Nearest Neighbor Classification (1992) *
Bayesian Probabilistic Models (1992) *
Symbolic Rule Induction (1994)
Decision Tree (1994) *
Neural Networks (1995)
Rocchio approach (traditional IR, 1996) *
Support Vector Machines (1997) *
Boosting or Bagging (1997)*
Hierarchical Language Modeling (1998)
First-Order-Logic Rule Induction (1999)
Maximum Entropy (1999)
Hidden Markov Models (1999)
Error-Correcting Output Coding (1999)

...

101

DF,

document frequency
information gain
mutual information
The 2 test , chi-square

KNN
SVM

102

103

You might also like