Professional Documents
Culture Documents
Email:yangjw@pku.edu.cn
1
AMS
Web catalogs
Yahoo
Open
Directoryhttp://www.dmoz.org/
Open Directory
http://www.dmoz.org/
(binary)
(multi-class)2
(multi-label)
(80)
and
()
(85%)
f :A B
B
8
(late 1980s)
10
()
11
1990
2200
232504
$15
Expert
System AIOCS
Development time: 192 person-months (2 people, 8 years)
Accuracy = 47%
12
13
14
1.
2.
3.
4.
5.
15
16
17
18
Given: Collection of example news stories
already labeled with a category (topic).
Task: Predict category for news stories not
yet labeled.
For our example, well only get to see the
headline of the news story.
Well represent categories using colors.
(All examples with the same color belong to
the same category.)
20
Amatil
Proposes
Two-forFive Bonus
Share Issue
Citibank
Norway
Unit Loses
Six Mln
Crowns in
1986
AnheuserBusch
Joins Bid
for San
Miguel
Italys La
Fondiaria
to Report
Higher
1986
Profits
Japan
Ministry Says
Open Farm
Trade Would
Hit U.S.
Isuzu Plans
No Interim
Dividend
Vieille
Montagne
Says 1986
Conditions
Unfavourable
Senator
Defends U.S.
Mandatory
Farm Control
Bill
Jardine
Matheson
Said It Sets
Two-for-Five
Bonus Issue
Replacing B
Shares
Bowater
Industries
Profit
Exceed
Expectations
21
Amatil Proposes
Two-for-Five
Bonus Share Issue
Citibank Norway
Unit Loses Six Mln
Crowns in 1986
Japan Ministry
Says Open Farm
Trade Would Hit
U.S.
Vieille Montagne
Says 1986
Conditions
Unfavourable
Jardine Matheson
Said It Sets Twofor-Five Bonus
Issue Replacing B
Shares
Anheuser-Busch
Joins Bid for San
Miguel
Italys La Fondiaria
to Report Higher
1986 Profits
Isuzu Plans No
Interim Dividend
Senator Defends
U.S. Mandatory
Farm Control Bill
Bowater Industries
Profit Exceed
Expectations
22
Senate
Panel
Studies
Loan Rate,
Set Aside
Plans
Amatil Proposes
Two-for-Five
Bonus Share Issue
Citibank Norway
Unit Loses Six Mln
Crowns in 1986
Japan Ministry
Says Open Farm
Trade Would Hit
U.S.
Vieille Montagne
Says 1986
Conditions
Unfavourable
Jardine Matheson
Said It Sets Twofor-Five Bonus
Issue Replacing B
Shares
Anheuser-Busch
Joins Bid for San
Miguel
Italys La Fondiaria
to Report Higher
1986 Profits
Isuzu Plans No
Interim Dividend
Senator Defends
U.S. Mandatory
Farm Control Bill
Bowater Industries
Profit Exceed
Expectations
23
Senate
Panel
Studies
Loan Rate,
Set Aside
Plans
Amatil Proposes
Two-for-Five
Bonus Share Issue
Citibank Norway
Unit Loses Six Mln
Crowns in 1986
Japan Ministry
Says Open Farm
Trade Would Hit
U.S.
Vieille Montagne
Says 1986
Conditions
Unfavourable
Jardine Matheson
Said It Sets Twofor-Five Bonus
Issue Replacing B
Shares
Anheuser-Busch
Joins Bid for San
Miguel
Italys La Fondiaria
to Report Higher
1986 Profits
Isuzu Plans No
Interim Dividend
Senator Defends
U.S. Mandatory
Farm Control Bill
Bowater Industries
Profit Exceed
Expectations
24
25
P, precision
R, recall
FMeasure
1
1
1
1
P
R
2 PR
F1
PR
26
Precision=a/(a+b)
Recall=a/(a+c),
miss
rate=1-recall
accuracy=(a+d)/(a+b+c+d),
error=(b+c)/(a+b+c+d)=1-accuracy
fallout=b/(b+d)=false alarm rate,
F=(2+1)pr/(2p+r)
4
2
Break Even Point, BEP, p=r
1
5
interpolated 11 point average precision (p-r)
27
Fi
Macro
2 Pi Ri
F1i
Pi Ri
1 m
Macro F Fi
m i 1
Micro
Micro F
1
1
1
Pi
Ri
(ni Fi )
i 1
ni
i 1
28
29
30
(feature extraction)
htmltag
(stop
words)(stemming)
()
(TF, DF)
(Feature Selection)
(Re-parameterisationLSA)
31
(a1j,a2j,,aMj)
N
AM*N= (aij)
T3
Cosine
D2 = 3T1 + 7T2 + T3
T2
T1
32
Term
Character
Word
Phrase
Concept
clusterword cluster/
N-gramN
window
David Lewis()
Words
33
(boolean weighting)
aij=1(TFij>0)
tf term frequency
or 0 (TFij=0)
TFIDF
TF: aij=TFij
TF*IDF: aij=TFij*log(N/DFi)
LTC: TF
TFC:
aij
aij
[TF
kj
1
.
0
)
*
log(
N
/
DF
)]
kj
k
k
34
35
over
fitting
Over-fitting problem
N10
stop words
36
DF, document frequency
information gain
mutual information
2
The test , chi-square
[kai:]
37
--DF
DF
TermDF
ad hoc
38
Contingency table (matrix)
39
--
entropy
Entropy(t ) Pi log Pi
Pi ci / c
Pi ( A B) / n
n A B C D
40
--
tci
P(ci|t)
A/(A+C)
ci
term
41
--IG
42
--MI
MIXY
P ( x, y )
MI ( X , Y ) P( x, y ) log
P( x) P( y )
x
y
P (x,y)x,y
43
--MI
X, Yt
t0,1,2,
mc={c1,c2,,cm}
t, c
44
--MI
(Mutual
Information)MI
tc
c
~c
(N=A+B+C+D)
P (t c)
I (t , c) log
P(t ) P(c)
~t
A/ N
A N
log
log
(( A B) / N ) * (( A C ) / N )
( A B )( A C )
m
I AVG (t ) P (ci ) I (t , ci )
i 1
I MAX (t ) max
m
i 1
P(ci ) I (t , ci )
45
--MI
MI(t,C)tC
P(t|c) t
MI
46
--
2
(chi-square)
chi-square(t,c)
47
--
2
2
(term)
2
AD<BC,,
N=A+B+C+D
2
N
(
AD
CB
)
2 (t , c)
( A C )( B D)( A B)(C D)
2 AVG (t ) P(ci ) 2 (t , ci )
i 1
MAX
(t ) max { (t , ci )}
m
i 1
c
A
~c
B
~t
48
DF:
Document Frequency
IG: Information Gain
m
G(t) = i1 Pr (ci ) log Pr (ci )
p r (t) i 1 Pr (ci | t ) log Pr (ci | t ) Pr (t )i 1 Pr (ci | t ) log Pr (ci | t )
m
MI:
Mutual Information
A N
I (t , c) log
( A C ) ( A B)
CHI:
(t , c)
2
N ( ADCB ) 2
( A C ) ( B D ) ( A B ) ( C D )
49
(kNN)
CHI.max
IG
DF
TS
MI.max
50
(LLSF)
CHI.max
IG
DF
TS
MI.max
51
52
53
Independency Binary
DTree
NB
NN
NNet
M-ary
Rocchio
SVM
LLSF
KNN
WORD
54
Decision
Trees
KNN(K-Nearest Neighbour)
Bayes Network
Neural Networks
Boosting
SVM
55
age?
<=30
student?
overcast
30..40
yes
>40
credit rating?
no
yes
excellent
fair
no
yes
no
yes
56
CART
C4.5
(ID3)
CHAID
(pruning)
57
m}
information measures info required to
classify any arbitrary tuple
si
si
I( s1,s2,...,sm) log 2
s
i 1 s
58
information
gained by branching on
attribute A
Gain(A) I(s 1, s 2 ,..., sm) E(A)
59
Rocchio
Rocchio
w jc w jc
'
iC
xij
nC
iC
n nC
CSVc ( d i ) w c x i
xij
w x
w x
cj
ij
cj
ij
60
K-NN
61
62
kNN
k=1, A
k=5B
k=10B
63
kNN
64
65
Nearest Neighbor
Similar item: We need a functional
definition of similarity if we
want to apply this automatically.
66
67
68
sim( x, d ) I (d , c)
d kNN of x
iff:
where
x
I(d,c) = 0 otherwise.
69
kNNk
70
kNNk
71
kNN
learning,
kNNlazy
72
- mk
x0
w0
x1
w1
xn
f
output y
wn
Input
weight
vector x vector w
weighted
sum
Activation
function
73
Bayesian
P( c j | d i )
P( d i | c j ) P( c j )
r
P( d i )
P( d i | c j ) P( c j )
P(d i | c j ) P( wik | c j )
k 1
c j
N (c j )
1 N (c j )
P( c j )
N (ck ) | c | N (ck )
k 1
P( wi | c j )
wi c j
c j
1 N ij
N kj
k
74
Bagging
Rfi
fi()N
dR
75
Boosting
Bagging
kk-1
AdaBoost
76
SVM
77
SVM
( x1 , y1 ),...( xl , yl ) x R n y {1,1}
( w x) b 0
78
79
80
81
82
83
84
margin
85
margin
86
margin
87
( w xi ) b 1, when : yi 1
( w xi ) b 1, when : yi 1
yi [( w xi ) b] 1 i 1,..., l
2 / || w ||
yi [( xi w) b] 1, i 1,2,..., l
1
( w) w w)
2
2
88
89
1
( w) w w)
2
yi [( xi w) b] 1, i 1,2,..., l
l
1
L( w, b, ) w w) i { yi [( xi w) b] 1} 3
2
i 1
()
l
L( w, b, )
0
i yi 0, i 1, , l
b
i 1
L( w, b, )
l
0
w yi i xi , i 1, , l
w
i 1
4
5
90
K-T()
((4)(5)(3))
l
1
L( ) w w) i { yi [( xi w) b] 1}
2
i 1
l
l
l
1
w w) i yi ( xi w) b( i yi ) i
2
i 1
i 1
i 1
(5)
l
1
i i j yi y j ( xi x j )
2 i 1 i 1
i 1
l
y
i 1
i 0,
0,
(4)
6
7
i 1,..., l
91
1
L( ) i
2
i 1
ij y i y j(x i
i
i
y i i
i
1
xj )
0; i 0, i 1,..., l
i 0
i 0
i 0
i{y i(w x i b ) 1} 0
*
*
f ( x) sgn yi i ( xi x) b
9
92
i 0
l
1
( w, ) ( w w) C i 10
2
i 1
yi [( w xi ) b] 1 i i 1,..., l 11
(6)(7)
(8)
0 i C
12
93
SVM
?
xi w b 0
?
In non linear case we can see this as K ( x , w) b 0
i
( Z i Z ) K ( x, xi )
()
l
1 l l
W ( ) i i j yi y j K ( xi , x j )
2 i 1 i 1
i 1
94
x , y ,X
k x, y x y
:X R
XF(Hilbert)
n
95
Sigmoid
Mercers
96
L*L
Lagrange
SMO
97
SVM
VC
98
(SVR)
99
(Ranking SVM)
x21
x22
x23
x11
x12
x13
100
...
101
DF,
document frequency
information gain
mutual information
The 2 test , chi-square
KNN
SVM
102
103