Professional Documents
Culture Documents
11
12
13
14
I.
ci
Ta c th nhn ra ba giai on khc nhau trong vic thit k h thng phn loi vn
bn: biu din ti liu, xy dng b phn loi, nh gi b phn loi.
Bc 1: Tin x l s liu
Mc ch ca bc ny l x l tng i sch d liu c vo cc bc sau s
x l tt hn, do cng vic ca bc ny s ch l chuyn vn bn c thnh chui k
t thun ty (text), do n s c yu cu nh sau:
-
u vo: Tp vn bn cn phi phn tch (File PDF, TXT, DOC, HTML, HTM)
u ra: chui k t thun ty (text only) vi font ch nh dng nh sn.
Thc hin :
-
Bc 2: Tch t:
Mc ch ca bc ny l tch mt vn bn text thun ty thnh t m bo t c ngha trong
vn bn cha n .Nh vy sau khi cht cu ta s xt cc nhim v sau y :
+Tch lc (Filtration)
Tch lc c bit n nh mt qu trnh ca s quyt nh nhng t no nn c
s dng biu din cho cc ti liu v th n c th c s dng cho:
M t ni dung ca vn bn
C s phn bit ti liu t nhng ti liu khc trong b su tp.
giai on ny ta loi b cc t stopword (danh mc cc t khng nh hng n
ni dung vn bn ). Trong ting anh ta c th lc cc t ny theo danh sch c
cung cp ti a ch http://armandbrahaj.blog.al/2009/04/14/list-of-english-stopwords/
+ Stemming (gc t)
Stemming l qu trnh lin quan n vic x l gim i s t i vi gc t hay ci
ngun khc nhau ca chng. Do vy, nhng t "computer", "computing", "compute"
c gim li thnh t "compute" v "walks", "walking" v "walker" c gim li thnh
"walk" . i vi ting Anh, b xc nh gc t ph bin l thut ton xc nh gc t ca
Martin Porter (Martin Porter's Stemming Algorithm).
Nh vy trong bc ny :
Tn s vn bn Document Frequency(df) tn s (hay s ln) xut hin ca t thut ng trong khi ti liu c.
u vo: Vecto cc t
u ra : Vn bn c biu din
-
lng
gi
kh
thuy
nhit
thi
tit
ma
hu
tit
d1
d2
d3
d4
d5
d6
d7
d8
d9
d10
Khi cn phn loi mt vn bn mi, thut ton s tnh khong cch (khong cch
Euclide, Cosine ...) ca tt c cc vn bn trong tp hun luyn n vn bn ny tm ra
k vn bn gn nht (gi l k lng ging), sau dng cc khong cch ny nh
trng s cho tt c ch . Trng s ca mt ch chnh l tng tt c khong cch
trn ca cc vn bn trong k lng ging c cng ch , ch no khng xut hin trong
k lng ging s c trng s bng 0. Sau cc ch s c sp xp theo mc trng
s gim dn v cc ch c trng s cao s c chn l ch ca vn bn cn phn
loi.
C 2 vn cn quan tm khi phn lp vn bn bng thut ton K- lng ging gn
nht l xc nh khi nim gn, cng thc tnh mc gn; v lm th no tm
c nhm vn bn ph hp nht vi vn bn (ni cch khc l tm c ch thch
hp gn cho vn bn).
Khi nim gn y c hiu l tng t gia cc vn bn. C nhiu cch xc
nh tng t gia hai vn bn, trong cng thc Cosine trng s c coi l hiu
qu nh gi tng t gia hai vn bn. Cho T={t1, t2, , tn} l tp hp cc thut
ng; W={wt1, wt2, , wtn} l vector trng s, wti l trng s ca thut ng ti. Xt hai vn
bn X={x1, x2, , xn} v Y={y1, y2, , yn}, xi, yi ln lt l tn s xut hin ca thut
ng ti trong vn bn X, Y. Khi tng t gia hai vn bn X v Y c tnh theo
cng thc sau:
Sim( X , Y ) cos ine( X , Y , W )
tT
tT
( xt wt ) ( yt wt )
( xt wt )
tT
( yt wt )