You are on page 1of 70

M u S pht trin nhanh chng ca mng Internet cng vi nhng bc tin mnh m ca cng ngh lu tr, lng thng

tin lu tr hin nay ang tr nn v cng ln. Thng tin c sinh ra lin tc mi ngy trn mng Internet, lng thng tin vn bn khng l trong v ang mang li li ch khng nh cho con ngi, tuy nhin, n cng khin chng ta kh khn trong vic tm kim v tng hp thng tin. Gii php cho vn ny l tm tt vn bn t ng. Tm tt vn bn t ng c xc nh l mt bi ton thuc lnh vc khi ph d liu vn bn; vic p dng tm tt vn bn s gip ngi dng tit kim thi gian c, ci thin tm kim cng nh tng hiu qu nh ch mc cho my tm kim. T nhu cu thc t nh th, bi ton tm tt vn bn t ng nhn c s quan tm nghin cu ca nhiu nh khoa hc, nhm nghin cu cng nh cc cng ty ln trn th gii. Cc bi bo lin quan n tm tt vn bn xut hin nhiu trong cc hi ngh ni ting nh : DUC1 2001-2007, TAC2 2008, ACL3 2001-2007 bn cnh cng l s pht trin ca cc h thng tm tt vn bn nh : MEAD, LexRank, Microsoft Word (Chc nng AutoSummarize) Mt trong nhng vn thch thc v c s quan tm trong nhng nm gn y i vi bi ton tm tt vn bn t ng l a ra kt qu tm tt cho mt tp vn bn lin quan vi nhau v mt ni dung hay cn gi l tm tt vn bn ting Vit. Bi ton tm tt vn bn ting Vit c xc nh l mt bi ton c phc tp cao. a s mi ngi ngh rng, tm tt vn bn ch l vic p dng tm tt n vn bn cho mt vn bn c ghp t cc vn bn trong mt tp vn bn cho trc. Tuy nhin

iu l hon ton khng chnh xc, thch thc ln nht ca vn tm tt vn l do d liu u vo c th c s nhp nhng ng ngha gia ni dung ca vn bn ny vi vn bn khc trong cng tp vn bn hay trnh t thi gian c trnh by trong mi mt vn bn l khc nhau, v vy a ra mt kt qu tm tt tt s v cng kh khn [EWK]. Rt nhiu ng dng cn n qu trnh tm tt vn bn nh: h thng hi p t ng (Q&A System), tm tt cc bo co lin quan n mt s kin, tm tt cc cm d liu c tr v t qu trnh phn cm trn my tm kim Hng nghin cu ng dng bi ton tm tt vn bn vo vic xy dng h thng hi p t ng ang l hng nghin cu chnh ca cng ng nghin cu tm tt vn bn nhng nm gn y. Rt nhiu nghin cu cho thy rng, vic s dng phng php tm tt vn bn da vo cu truy vn (Query-based multi-document summarization) i vi kho d liu tri thc a ra mt vn bn tm tt tr li cho cu hi ca ngi s dng t c nhiu kt qu kh quan cng nh th hin y l mt hng tip cn ng n trong vic xy dng cc m hnh hi p t ng [Ba07,YYL07]. Vi vic la chn ti Tm tt vn bn ting Vit da phng php khng gim st, chng ti tp trung vo vic nghin cu, kho st, nh gi v xut ra mt phng php tm tt vn bn ph hp vi ngn ng ting Vit, bn cnh p dng phng php ny vo vic xy dng mt m hnh h thng hi p ting Vit. Ngoi phn m u v kt lun, bo co c t chc thnh 5 chng nh sau: Chng 1: Khi qut bi ton tm tt gii thiu khi qut bi ton tm tt vn bn t ng ni chung v bi ton tm tt a vn bn ni ring, trnh by mt s khi nim v cch phn loi i vi bi ton tm tt.

Chng 2: Tm tt a vn bn da vo trch xut cu gii thiu chi tit v hng tip cn, thch thc v cc vn trong gii quyt bi ton tm tt a vn bn da vo trch xut cu. Chng 3: tng ng cu v cc phng php tng cng tnh ng ngha cho tng ng cu trnh by cc nghin cu v cc phng php tnh tng ng ng ngha cu tiu biu p dng vo qu trnh trch xut cu quan trng ca vn bn. Chng 4: Mt s xut tng cng tnh ng ngha cho tng ng cu v p dng vo m hnh tm tt a vn ting Vit phn tch, xut mt phng php tch hp cc thut ton gii quyt bi ton tm tt a vn bn ting Vit v trnh by vic p dng phng php c xut xy dng m hnh h thng hi p ting Vit n gin. Chng 5: Thc nghim v nh gi trnh by qu trnh th nghim ca lun vn v a ra mt s nh gi, nhn xt cc kt qu t c.

Chng 1 Khi qut bi ton tm tt vn bn 1.1 Bi ton tm tt vn bn t ng Vo nm 1958, Luhn ca IBM trnh by phng php tm tt t ng cho cc bi bo k thut s dng phng php thng k thng qua tn sut v phn b ca cc t trong vn bn [Lu58]. Tuy nhin mi cho n nhng nm cui th k 20, vi s pht trin ca Internet, lng thng tin bng n nhanh chng, vic thu nhn nhng thng tin quan trng cng tr thnh mt vn thit yu th bi ton tm tt vn bn t ng mi c s quan tm thit thc ca nhiu nh nghin cu. Theo Inderjeet Mani, mc ch ca tm tt vn bn t ng l: Tm tt vn bn t ng nhm mc ch trch xut ni dung t mt ngun thng tin v trnh by cc ni dung quan trng nht cho ngi s dng theo mt khun dng sc tch v gy cm xc i vi ngi s dng hoc mt chng trnh cn n [MM99]. Vic a ra c mt vn bn kt qu tm tt c cht lng nh l vn bn do con ngi lm ra m khng b gii hn bi min ng dng l c xc nh l cc k kh khn. V vy, cc bi ton c gii quyt trong tm tt vn bn thng ch hng n mt kiu vn bn c th hoc mt kiu tm tt c th. 1.2 Mt s khi nim ca bi ton tm tt v phn loi tm tt - T l nn(Compression Rate): l o th hin bao nhiu thng tin c c ng trong vn bn tm tt c tnh bng cng thc: SourceLength CompressionRate = SummaryLength

SummaryLength: di vn bn tm tt SourceLength: di vn bn ngun - ni bt hay lin quan(Salience or Relevance): l trng s c gn cho thng tin trong vn bn th hin quan trng ca thng tin i vi ton vn bn hay ch s lin quan ca thng tin i vi chng trnh ca ngi s dng. - S mch lc(coherence): Mt vn bn tm tt gi l mch lc nu tt c cc thnh phn nm trong n tun theo mt th thng nht v mt ni dung v khng c s trng lp gia cc thnh phn. Phn loi bi ton tm tt. C nhiu cch phn loi tm tt vn bn khc nhau tuy nhin s phn loi ch mang tnh tng i, ph thuc vo vic tm tt trn c s no. y, lun vn cp n phn loi tm tt da trn 3 c s l: da vo nh dng, ni dung u vo, da vo nh dng, ni dung u ra, da vo mc ch tm tt. Tm tt da trn c s nh dng, ni dung u vo s tr li cho cu hi Ci g s c tm tt. Cch chia ny s cho ta nhiu cch phn loi con khc nhau. C th nh: - Kiu vn bn (bi bo, bn tin, th, bo co ). Vi cch phn loi ny, tm tt vn bn l bi bo s khc vi tm tt th, tm tt bo co khoa hc do nhng c trng vn bn quy nh. - nh dng vn bn: da vo tng nh dng vn bn khc nhau, tm tt cng chia ra thnh cc loi khc nhau nh: tm tt vn bn khng theo khun mu (free-form)

hay tm tt vn bn c cu trc. Vi vn bn c cu trc, tm tt vn bn thng s dng mt m hnh hc da vo mu cu trc xy dng t trc tin hnh tm tt. - S lng d liu u vo: ty vo s lng u vo ca bi ton tm tt, ngi ta cng c th chia tm tt ra thnh tm tt a vn bn, tm tt n vn bn. Tm tt n vn bn khi u vo ch l mt vn bn n, trong khi u vo ca tm tt a vn bn l mt tp cc ti liu c lin quan n nhau nh: cc tin tc c lin quan n cng mt s kin, cc trang web cng ch hoc l cm d liu c tr v t qu trnh phn cm. - Min d liu: da vo min ca d liu nh c th v mt lnh vc no , v d nh: y t, gio dc hay l min d liu tng qut, c th chia tm tt ra thnh tng loi tng ng. Tm tt trn c s mc ch thc cht l lm r cch tm tt, mc ch tm tt l g, tm tt phc v i tng no ... - Nu ph thuc vo i tng c tm tt th tm tt cho chuyn gia khc cch tm tt cho cc i tng c thng thng. - Tm tt s dng trong tm kim thng tin (IR) s khc vi tm tt phc v cho vic sp xp. - Da trn mc ch tm tt, cn c th chia ra thnh tm tt ch th (Indicative) v tm tt thng tin (Informative). Tm tt ch th (indicative) ch ra loi ca thng tin, v d nh l loi vn bn ch th ti mt. Cn tm tt thng tin ch ra ni dung ca thng tin. - Tm tt trn c s truy vn (Query-based) hay tm tt chung (General). Tm tt general mc ch chnh l tm ra mt on tm tt cho ton b vn bn m ni

dung ca on vn bn s bao qut ton b ni dung ca vn bn . Tm tt trn c s truy vn th ni dung ca vn bn tm tt s da trn truy vn ca ngi dng hay chng trnh a vo, loi tm tt ny thng c s dng trong qu trnh tm tt cc kt qu tr v t my tm kim. Tm tt trn c s u ra cng c nhiu cch phn loi. - Da vo ngn ng: Tm tt cng c th phn loi da vo kh nng tm tt cc loi ngn ng: Tm tt n ngn ng (Monolingual): h thng c th tm tt ch mt loi ngn ng nht nh nh: ting Vit hay ting Anh Tm tt a ngn ng (Multilingual): h thng c kh nng tm tt nhiu loi vn bn ca cc ngn ng khc nhau, tuy nhin tng ng vi vn bn u vo l ngn ng g th vn bn u ra cng l ngn ng tng ng. Tm tt xuyn ngn ng (Crosslingual): h thng c kh nng a ra cc vn bn u ra c ngn ng khc vi ngn ng ca vn bn u vo. - Da vo nh dng u ra ca kt qu tm tt: nh bng, on, t kha. Ngoi hai cch phn loi trn, phn loi tm tt trn c s u ra cn c mt cch phn loi c s dng ph bin l: tm tt theo trch xut (Extract) v tm tt theo tm lc (Abstract). Tm tt theo trch xut: l tm tt c kt qu u ra l mt tm tt bao gm ton b cc phn quan trng c trch ra t vn bn u vo.

Tm tt theo tm lc: l tm tt c kt qu u ra l mt tm tt khng gi nguyn li cc thnh phn ca vn bn u vo m da vo thng tin quan trng vit li mt vn bn tm tt mi. Hin nay, cc h thng s dng tm tt theo trch xut c s dng ph bin v cho kt qu tt hn tm tt theo tm lc. Nguyn nhn to ra s khc bit ny l do cc vn trong bi ton tm tt theo tm lc nh: biu din ng ngha, suy lun v sinh ra ngn ng t nhin c nh gi l kh v cha c nhiu kt qu nghin cu kh quan hn so vi hng trch xut cu ca bi ton tm tt theo trch xut. Trong thc t, theo nh gi ca Dragomir R. Radev (i hc Michigan, M) cha c mt h thng tm tt theo tm lc t n s hon thin, cc h thng tm tt theo tm lc hin nay thng da vo thnh phn trch xut c sn. Cc h thng ny thng c bit n vi tn gi tm tt theo nn vn bn. Tm tt theo nn vn bn (Text Compaction): l loi tm tt s dng cc phng php ct xn(truncates) hay vit gn(abbreviates) i vi cc thng tin quan trng sau khi c trch xut. Mc d da vo nhiu c s c nhiu loi tm tt khc nhau tuy nhin hai loi tm tt l tm tt n vn bn v tm tt a vn bn vn c s quan tm ln ca cc nh nghin cu v tm tt t ng. 1.3 Tm tt vn bn Bi ton tm tt vn bn n cng ging nh cc bi ton tm tt khc, l mt qu trnh tm tt t ng vi u vo l mt vn bn, u ra l mt on m t ngn gn ni dung chnh ca vn bn u vo . Vn bn n c th l mt trang Web, mt bi bo,

hoc mt ti liu vi nh dng xc nh (v d : .doc, .txt) Tm tt vn bn n l bc m cho vic x l tm tt a vn bn v cc bi ton tm tt phc tp hn. Chnh v th nhng phng php tm tt vn bn ra i u tin u l cc phng php tm tt cho vn bn n. Cc phng php nhm gii quyt bi ton tm tt vn bn n cng tp trung vo hai loi tm tt l: tm tt theo trch xut v tm tt theo tm lc. Tm tt theo trch xut a s cc phng tm tt theo loi ny u tp trung vo vic trch xut ra cc cu hay cc ng ni bt t cc on vn bn v kt hp chng li thnh mt vn bn tm tt. Mt s nghin cu giai on u thng s dng cc c trng nh v tr ca cu trong vn bn, tn s xut hin ca t, ng hay s dng cc cm t kha tnh ton trng s ca mi cu, qua chn ra cc cu c trng s cao nht cho vn bn tm tt [Lu58, Ed69]. Cc k thut tm tt gn y s dng cc phng php hc my v x l ngn ng t nhin nhm phn tch tm ra cc thnh phn quan trng ca vn bn. S dng cc phng php hc my c th k n phng php ca Kupiec, Penderson and Chen nm 1995 s dng phn lp Bayes kt hp cc c trng li vi nhau [PKC95] hay nghin cu ca Lin v Hovy nm 1997 p dng phng php hc my nhm xc nh v tr ca cc cu quan trng trong vn bn [LH97]. Bn cnh vic p dng cc phng php phn tch ngn ng t nhin nh s dng mng t Wordnet ca Barzilay v Elhadad vo nm 1997 [BE97].

Tm tt theo tm lc Cc phng php tm tt khng s dng trch xut to ra tm tt c th xem nh l mt phng php tip cn tm tt theo tm lc. Cc hng tip cn c th k n nh da vo trch xut thng tin (information extraction), ontology, hp nht v nn thng tin Mt trong nhng phng php tm tt theo tm lc cho kt qu tt l cc phng php da vo trch xut thng tin, phng php dng ny s dng cc mu c nh ngha trc v mt s kin hay l ct truyn v h thng s t ng in cc thng tin vo trong mu c sn ri sinh ra kt qu tm tt. Mc d cho ra kt qu tt tuy nhin cc phng php dng ny thng ch p dng trong mt min nht nh [MR95]. 1.4 Tm tt chng mt Trong chng ny lun vn gii thiu khi qut bi ton tm tt vn bn t ng cc vn lin quan v cch phn loi i vi bi ton tm tt vn bn t ng. Trong chng tip theo, lun vn s lm r cc vn ca bi ton tm tt a vn bn ni chung v bi ton tm tt a vn bn da vo trch xut cu ni ring.

Chng 2. Tm tt a vn bn da vo trch xut cu

2.1 Hng tip cn ca bi ton tm tt vn bn Nh chng ta bit trn tm tt vn bn ni chung v tm tt a vn bn ni ring l bi ton thuc lnh vc x l ngn ng t nhin. Trong phn tch x l ngn ng t nhin c cc mc su x l khc nhau c sp xp theo th t nh sau: u tin l mc hnh thi (Morphological), tip theo l mc c php (Syntactic), tip n l mc ng ngha (Semantic) v cui cng l mc ng dng (Pragmatic). Tng t nh cc su x l ca x l ngn ng t nhin, phng php tip cn gii quyt bi ton tm tt a vn bn cng c th c phn loi da vo su x l c thc hin trong qu trnh tm tt. Tuy nhin phng php tip cn gii quyt bi ton tm tt a vn bn ch c ba mc, l cc mc: hnh thi, c php v ng ngha. Mc hnh thi: ti mc x l ny, trong cc vn bn, n v c s dng so snh l cc ng, cu hay on vn (paragraph). Cc phng php ti mc ny thng s dng o tng ng da trn m hnh khng gian vector (Vector space model) p dng trng s TF.IDF cho cc t v cc cu. Phng php tm tt MMR [CG98] l phng php ni bt ti mc x l ny. Mc c php: n v c s dng so snh ti mc x l ny l s dng vic phn tch nhng cu trc ng php tng ng gia cc vn bn vi nhau. Cc phng php ti mc ny tp trung vo vic phn tch cu trc ng php gia cc cu hay cc ng trong tng on vn thuc cc vn bn. Phng php do Barzilay v cc ng tc gi khc xut nm 1999 [BME99] thuc mc x l ny.

Mc ng ngha: ti mc x l ny tp trung nhiu vo vic phn tch cc tn thc th, mi quan h gia cc thc th cng nh cc s kin ny sinh thc th xc nh c quan trng ca thng tin. Phng php ca McKeown v Radev xut nm 1995[MR95] l mt dng ca tm tt ti mc x l ny. Da vo cc c trng ca tng phng php tip cn, Inderjeet Mani a ra bng so snh, nh gi ba mc tip cn gii quyt bi ton tm tt a vn bn [Ma01]. Mc x l c tnh u im Nhc im Mc hnh thi S dng nhiu cc o tng ng gia cc t vng S dng rt ph bin, x l d tha tt Khng th m t cc c trng khc, kh nng tng hp thng tin km. Mc c php So snh gia cc cy c php ca cu hay ng trong vn bn C kh nng pht

hin cc khi nim tng ng trong cc ng,cho php tng hp thng tin. Khng th m t cc c trng khc, i hi phi m rng cc lut so snh gia cc cy c php Mc ng ngha So snh gia cc mu ti liu c n nh. C kh nng m t nhiu c trng khc nhau. Cc mu phi c to trc i vi tng min. Bng 2.1. Bng so snh cc phng php tip cn tm tt a vn bn [Ma01]. 2.2. Cc thch thc ca qu trnh tm tt a vn bn

Mt trong nhng thch thc ln nht ca tm tt a vn bn chnh l s nhp nhng ni dung gia cc vn bn. C ba nguyn nhn gy ra nhp nhng ni dung trong tm tt a vn bn l: ng tham chiu xuyn vn bn, nhp nhng v thi gian xuyn vn bn, s trng lp ni dung gia cc vn bn. Trng lp i t v ng tham chiu Thng thng, chng ta cp n mt tn thc th chnh l ni n tn ban u ca thc th y v sau thng hay s dng mt i t thay th ni v thc th trn. Xc nh chnh xc c thc th m i t ch n c gi l vic xc nh trng lp i t (Pronominal Anaphora resolution). Vic xc nh ng hai hay nhiu hn cc thc th ca nhiu vn bn khc nhau cng ch n mt thc th c gi l vn xc nh ng tham chiu xuyn vn bn (Cross Document Co-Reference). Vn ny cn phi c gii quyt tt th kt qu u ra ca tm tt a vn bn mi cho ra kt qu tt v d hiu. Nhp nhng mt thi gian Cc vn bn trong cm ti liu c th c ch n bi nhiu t hay cm t ch thi gian v d: hm qua, hm nay Vic xc nh r rng cc mc thi gian tng ng l mt iu kin cn sp xp cc cu hay cc vn bn theo ng trnh t hp l. Mt s h thng c kh nng xc nh c mc thi gian v thay th cc mc thi gian tng i thnh cc mc thi gian tuyt i bng vic phn tch ni dung ca vn bn. m bo tnh c th c c i vi vn bn tm tt ca h thng tm tt a vn bn th ba yu t: Xc nh trng lp i t, xc nh ng tham chiu xuyn vn bn v nhp nhng v mt thi gian cn phi c gii quyt tt. Mc d, trong tm tt n

vn bn hai yu t u tin vn xut hin tuy nhin gii quyt hai vn ny khng phc tp nh gii quyt trong tm tt a vn bn. Bn cnh , vn nhp nhng thi gian khng xut hin trong tm tt vn bn n, do cc vn bn n u vo coi nh m bo v mt trt t, yu t ny do chnh ngi to ra vn bn to nn [Ji98]. Mc d vy i vi tm tt a vn bn, vn ny tr nn cc k kh khn, cc nghin cu xoay quanh vn ny ch tp trung vo cc loi d liu c i km vi thi gian nh tin tc hay chui cc s kin. Mt trong cc phng php gii quyt tt vn ny c Barzilay, Elhadad v McKeown a ra vo nm 2002 [BME02]. Cn i vi cc tp d liu khng r rng v mt thi gian, cc nh nghin cu mc nh nh cc vn bn tng ng v mt thi gian. S chng cho ni dung gia cc ti liu Mt cu hi m nhiu ngi t ra i vi tm tt a vn bn l: - Liu c th ghp cc vn bn li vi nhau ri s dng tm tt n vn bn? - Cu tr li y l khng! Bng cch chng ta s khng to ra c mt vn bn tm tt tt do khng loi b c s chng cho v mt ni dung cng nh xc nh c mi quan h gia cc vn bn. Mi quan h gia cc vn bn c rt nhiu loi khc nhau. Dragomir Radev lit k ra 24 loi quan h gia cc vn bn [Ra00] nh trong bng 2.2. Cc mi quan h tn ti nhiu mc khc nhau: mc t (W), mc ng (P), mc on hoc mc cu (S), mc ton ti liu (D).

y l mt taxonomy ca cc mi quan h xuyn ti liu c gi l Crossdocument Structure Theory (CST). Vic s dng tt CST s to hiu qu cc k hu ch cho vic xc nh s trng lp gia cc vn bn trong bi ton tm tt a vn bn. T l nn Bn cnh cc vn nhp nhng v mt ni dung th t l nn cng l mt vn c t ra khi ni n tm tt a vn bn. Trong tm tt n vn bn, t l 10% so vi chiu di ca vn bn gc c th i vi mt vn bn tm tt. Tuy nhin i vi mt cm ti liu n ti liu vi t l 10% ta c mt vn bn c di 0.1n di trung bnh vn bn. Vi n l bin, vn bn tm tt c th s tr nn ln hn nhiu so vi nhu cu ca ngi s dng mun c. Chnh v vy i vi tm tt a vn bn, t l nn cn c s lin quan n kch thc ca cm ti liu . i vi tm tt a vn bn da vo trch xut cu a ra mt vn bn tm tt c di ph hp vi yu cu ca ngi s dng, t l nn thng c thay th bng s lng cu ca vn bn tm tt. 2.3. nh gi kt qu tm tt nh gi kt qu tm tt vn bn l mt vic lm kh khn trong thi im hin ti. Vic s dng kin nh gi ca cc chuyn gia ngn ng c xem l cch nh gi tt nht, tuy nhin, cch lm ny li tn rt nhiu chi ph. Bn cnh cc phng php nh gi th cng do cc chuyn gia thc hin, vn nh gi t ng kt qu tm tt cng nhn c nhiu s ch hin nay. NIST1 k t nm 2000 t chc hi ngh DUC mi nm mt ln thc hin vic nh gi vi quy m ln cc h thng tm tt vn bn.Vic nh gi t ng ny nhm mc ch l tm ra c mt o nh gi tm tt gn vi nhng nh gi ca con ngi nht.

hi tng (recall) ti cc t l nn khc nhau chnh l thc o nh gi hp l, mc d n khng ch ra c s khc nhau v hiu sut ca h thng. V vy o v s bao ph c tnh theo cng thc: C=RE y, R l hi tng cu c tr v bi cng thc R = S n v bao ph/ Tng s n v trong m hnh tm tt. E l t l hon thnh nm trong khong t 0 n 1 (1 l hon thnh tt c, l mt phn, l mt s, l kh, 0 l khng c) DUC 2002 s dng mt phin bn iu chnh chiu di ca thc o bao ph, C:

Vi B l s ngn gn v l tham s phn tm quan trng. Cc loi nhn cho E cng c thay i thnh 100%, 80%, 60%, 40%, 20%, v 0% tng ng. Phng php ROUGE BiLingual Evaluation Understudy (BLEU) [KST02] l mt phng php ca cng ng dch my a ra nh gi t ng cc h thng dch my. Phng php ny c hiu qua nhanh, c lp vi ngn ng v s lin quan vi cc nh gi ca con ngi. Recall Oriented Understudy of Gisting Evaluation (ROUGE) [LH03] l mt phng php do Lin v Hovy a ra vo nm 2003 cng da trn cc khi nim tng t. Phng php ny s dng n-gram nh gi s tng quan gia cc kt qu ca m hnh tm tt v tp d liu nh gi. Phng php ny cho ra kt qu kh quan v c s nh gi cao ca cng ng nghin cu tm tt vn bn.

2.4. Tm tt a vn bn da vo trch xut cu Tm tt a vn bn da vo trch xut cu l phng php gii quyt bi ton tm tt a vn bn theo hng tip cn mc hnh thi. Phng php ny c u im l x l tt cc d tha do chng cho v mt ni dung gia cc vn bn trong cm v cho ra hiu qu cao i vi vn bn tm tt. Chnh v u im ny nn tm tt a vn bn da vo trch xut cu c s quan tm,pht trin v s dng rng ri ca cng ng tm tt vn bn t ng [HMR05, FMN07, BKO07]. Mc d c nhiu phng php c cng b nhng hu ht cc phng php u tp trung vo gii quyt hai vn chnh, l: - Xc nh v loi b s trng lp, chng cho v mt ni dung gia cc vn bn. - Sp xp cc cu trong cc vn bn theo ni bt(quan trng) v mt ni dung hoc lin quan n mt truy vn do ngi s dng hay chng trnh cung cp. 2.4.1. Loi b chng cho v sp xp cc vn bn theo quan trng Loi b chng cho v sp xp quan trng gia cc vn bn trong cm vn bn l mt trong nhng vn quan trng nht ca bi ton tm tt a vn bn. Mt trong cc phng php ph bin tnh c quan trng ny l phng php MMR (Maximal Maginal Relevance) do Jaime Carbonell v Jade Goldstein xut nm 1998 [CG98]. u vo ca phng php ny l mt cm vn bn c sp xp sn v u ra l cm vn bn c sp xp li theo th t v ng ngha. Phng php ny sp xp cc vn bn da vo vic xc nh mt o lm r ranh gii v ng ngha gia cc vn bn trong cm. Mi mt vn bn c o ny cc i nu o v s tng ng gia vn bn vi cu truy vn cao v cc tiu c s tng ng gia vn bn ny v cc vn bn khc c chn trc y. Cng thc tnh o ny nh sau:

Trong : : l tham s nm trong ngng [0,1] quyt nh vic ng gp gia 2 o. Nu =1 th quan trng ca vn bn ch ph thuc vo o tng ng gia vn bn v cu truy vn, cn nu =0 th o s tng ng gia vn bn ny v vn bn khc s t gi tr cc i trong biu thc trn. C: cm vn bn. Di: vn bn thuc cm C. Q: l cu truy vn (hay cu hi ngi dng a vo). R=IR(C,Q,) : l tp cc vn bn ca C c sp xp th t theo s lin quan vi cu truy vn Q da vo mt ngng xc nh . S: l tp cc vn bn ca R c chn . R\S: l tp cc vn bn cha c chn ca R. Sim1,Sim2: l o v s tng ng gia hai vn bn. 2.4.2. Phng php sp xp cu Xc nh quan trng cu l bc xut hin hu ht trong cc phng php tm tt n vn bn cng nh tm tt a vn bn hin nay. o quan trng ny c th c xy dng bng cch kt hp nhiu o tng ng cu khc nhau vi cc phng php ci tin t phng php MMR lm tng quan trng i vi 18 mc ng ngha cu [HMR05, FMN07, BKO07]. Cng thc ca phng php MMR c ci tin cho mc ng ngha cu: ( ) argmax[ * ( , ) (1 ) *max ( , )] i j

s i Score s sim s q sim s s i = Trong : : l tham s nm trong ngng [0,1] quyt nh vic ng gp gia 2 o. q: l cu truy vn (hay cu hi ngi dng a vo). si: l mt cu trong cm vn bn. sj: cc cu khc nm trong cm vn bn sim: o v s tng ng gia hai cu Nhn xt C hai vn cn gii quyt trong bi ton tm tt a vn bn da vo trch xut cu u tp trung vo vic xc nh c s tng ng gia hai vn bn ni chung v gia hai cu ni ring. Trn thc t, cc phng php p dng v ci tin cho tm tt a vn bn da vo u tp trung vo vn l tng cng tnh ng ngha cho o tng ng gia hai cu hay hai vn bn [HMR05, FMN07, BKO07]. Trong chng 3, lun vn s i su vo gii thiu chi tit n cc phng php tng cng tnh ng ngha cho tng ng cu. 2.5. Tm tt chng hai Trong chng ny lun vn gii thiu chi tit n hng tip cn, cc vn t ra i vi bi ton tm tt a vn bn v mt s phng php gii quyt cc vn trn. Trong chng tip theo, lun vn tip tc tp trung vo vic gii thiu cc

phng php nhm tng cng tnh ng ngha cho tng ng gia hai cu.

Chng 3. tng ng cu v cc phng php tng cng tnh ng ngha cho tng ng cu

3.1. tng ng Trong ton hc, mt o l mt hm s cho tng ng vi mt "chiu di", mt "th tch" hoc mt "xc sut" vi mt phn no ca mt tp hp cho sn. N l mt khi nim quan trng trong gii tch v trong l thuyt xc sut. V d, o m c nh ngha bi (S) = s phn t ca S Rt kh o s ging nhau, s tng ng. S tng ng l mt i lng (con s) phn nh cng ca mi quan h gia hai i tng hoc hai c trng. i lng ny thng trong phm vi t -1 n 1 hoc 0 n 1. Nh vy, mt o tng ng c th coi l mt loi scoring function (hm tnh im). V d, trong m hnh khng gian vector, ta s dng o cosine tnh tng ng gia hai vn bn, mi vn bn c biu din bi mt vector. 3.2. tng ng cu Pht biu bi ton tnh tng ng cu nh sau: Xt mt ti liu d gm c n cu: d = s1, s2, ... , sn. Mc tiu ca bi ton l tm ra mt gi tr ca hm S(si, sj) vi S(0,1), v i, j = 1, ..., n. Hm S(si, sj) c gi l o tng ng gia hai cu si v sj. Gi tr cng cao th s ging nhau v ngha ca hai cu cng nhiu. V d: Xt hai cu sau: Ti l nam v Ti l n, bng trc gic c th thy rng hai cu trn c s tng ng kh cao. tng ng ng ngha l mt gi tr tin cy phn nh mi quan h ng

ngha gia hai cu. Trn thc t, kh c th ly mt gi tr c chnh xc cao bi v ng ngha ch c hiu y trong mt ng cnh c th. 20 3.3. Cc phng php tnh tng ng cu Bi ton tng ng ng ngha cu c s dng ph bin trong lnh vc x l ngn ng t nhin v c nhiu kt qu kh quan. Mt s phng php c s dng tnh o ny nh [SD08, LLB06, RFF05, STP06]: - Phng php s dng thng k: o cosine, o khong cch euclid - Phng php s dng cc tp d liu chun v ngn ng tm ra mi quan h gia cc t: Wordnet, Brown Corpus, Penn TreeBank Cc phng php tnh tng ng cu s dng kho ng liu Wordnet c nh gi cho ra kt qu cao. Tuy nhin, kho ng liu Wordnet ch h tr ngn ng ting Anh, vic xy dng kho ng liu ny cho cc ngn ng khc i hi s tn km v mt chi ph, nhn lc v thi gian. Nhiu phng php c xut thay th Wordnet cho cc ngn ng khc, trong vic s dng phn tch ch n [Tu08] hay s dng mng ng ngha Wikipedia thay th Wordnet [SP06, ZG07, ZGM07] c xem nh l cc phng n kh thi v hiu qu. Cc phng php ny tp trung vo vic b sung cc thnh phn ng ngha h tr cho o tng ng Cosine. 3.3.1. Phng php tnh tng ng cu s dng o Cosine Trong phng php tnh ny, cc cu s c biu din theo mt m hnh khng gian vector. Mi thnh phn trong vector ch n mt t tng ng trong danh sch mc t chnh. Danh sch mc t chnh thu c t qu trnh tin x l vn bn

u vo, cc bc tin x l gm: tch cu, tch t, gn nhn t loi, loi b nhng cu khng hp l (khng phi l cu thc s) v biu din cu trn khng gian vect. Khng gian vector c kch thc bng s mc t trong danh sch mc t chnh. Mi phn t l quan trng ca mc t tng ng trong cu. quan trng ca t j c tnh bng TF nh sau: = j ij ij i j tf tf w 2 , , , 21 Trong , tfi,j l tn s xut hin ca mc t i trong cu j. Vi khng gian biu din ti liu c chn l khng gian vector v trng s TF, o tng ng c chn l cosine ca gc gia hai vector tng ng ca hai cu Si v Sk. Vector biu din hai cu ln lt c dng:

Si = <w1 i, , wt i> , vi wt i l trng s ca t th t trong cu i Sk = <w1 k, , wt k> , vi wt k l trng s ca t th t trong cu k tng t gia chng c tnh theo cng thc: ()() == = = t j t j kj i j

t j kj i j ij ww ww Sim S S 11 22 (,)1 Trn cc vector biu din cho cc cu lc ny cha xt n cc quan h ng ngha gia cc mc t, do cc t ng ngha s khng c pht hin, dn n kt qu xt tng t gia cc cu cha tt. V d nh cho hai cu sau: S1 : Cn trao i kin k trc khi ly biu quyt. S2 : Hi m din ra trong bu khng kh thn mt v hiu bit ln nhau. Nu khng xt n quan h ng ngha gia cc t th hai cu trn khng c mi lin h g c v tng ng bng 0. Nhng thc cht, ta thy rng, t nhn loi v t loi ngi l ng ngha, hai cu trn u ni v loi ngi, do gia hai cu c mt s lin quan nht nh v vi cng thc tnh tng t nh trn th tng t gia hai cu ny phi khc 0.

3.3.2. Phng php tnh tng ng cu da vo ch n Phng php tip cn bi ton tnh tng ng cu s dng ch n da trn c s cc nghin cu thnh cng gn y ca m hnh phn tch topic n LDA (Latent Dirichlet Allocation). tng c bn ca m hnh l vi mi ln hc, ta tp hp mt tp d liu ln c gi l Universal dataset v xy dng mt m hnh hc 22 trn c d liu hc v mt tp giu cc topic n c tm ra t tp d liu [Tu08, HHM08]. M hnh tng ng cu s dng ch n Di y l m hnh chung tnh tng ng cu vi ch n: Hnh 3.1. Tnh tng ng cu vi ch n Mc ch ca vic s dng ch n l tng cng ng ngha cho cc cu hay ni cch khc ngha ca cc cu s c phn bit r hn thng qua vic thm cc ch n. u tin chn mt tp universal dataset v phn tch ch cho n. Qu trnh phn tch ch chnh l qu trnh c lng tham s theo m hnh LDA. Kt qu ly ra c cc ch trong tp universal dataset, cc ch ny c gi l ch n. Qu trnh trn c thc hin bn ngoi m hnh tnh tng ng cu vi ch n. Trong Hnh 3.2, vi u vo l mt vn bn n, sau cc bc tin x l vn bn s thu c mt danh sch cc cu. Tip theo, suy lun ch cho cc cu qua tin x l, kt qu thu c mt danh sch cc cu c thm ch n. T y, c th ln lt tnh ton tng ng gia cc cu c thm ch n.

23 Suy lun ch v tnh tng ng cc cu Vi mi cu, sau khi suy lun ch cho cu s nhn c cc phn phi xc sut ca topic trn cu v phn phi xc sut ca t trn topic. Tc l vi mi cu i, LDA sinh ra phn phi topic i cho cu. Vi mi t trong cu, zi,j topic index (t j ca cu i) - c ly mu da theo phn phi topic trn. Sau , da vo topic index zi,j ta lm giu cc cu bng cch thm t. Vector tng ng vi cu th i c dng nh sau: [Tu08]Error! Reference source not found. y, ti l trng s ca topic th i trong K topic c phn tch (K l mt tham s hng ca LDA); wi l trng s ca t th i trong tp t vng V ca tt c cc cu. Mi cu c th c nhiu phn phi xc sut topic. Vi hai cu th i v j, chng ta s dng cosine tnh tng ng gia hai cu c lm giu vi ch n. == =

= K k jk K k ik K k ikjk ij tt tt sim topic parts 1 2 , 1 2, 1 ,, ,()

== = = V t jt V t it V t itjt ij ww ww sim word parts 1 2 ,

1 2, 1 ,, ,() Cui cng, t hp hai o trn ra tng ng gia hai cu: Trong cng thc trn, l hng s trn, thng nm trong on [0,1]. N quyt nh vic ng gp gia 2 o tng ng. Nu = 0 , tng ng gia hai cu khng c ch n. Nu = 1, o tng ng gia hai cu ch tnh vi ch n [Tu08]. { } 1 2 1 | | , ,..., , ,..., i K V s = t t t w w sim(s , s ) sim(topic parts) (1 ) sim(word - parts) i j = + 24 3.3.3. Phng php tnh tng ng cu da vo Wikipedia Gii thiu mng ng ngha Wikipedia Wikipedia1 l mt bch khoa ton th ni dung m bng nhiu ngn ng trn Internet. Wikipedia c vit v xy dng do rt nhiu ngi dng cng cng tc vi nhau. D n ny, ni chung, bt u t ngy 15 thng 1 nm 2001 b sung bch khoa ton th Nupedia bi nhng nh chuyn mn; hin nay Wikipedia trc thuc Qu H tr Wikimedia, mt t chc phi li nhun. Wikipedia hin c hn 200 phin bn ngn ng, trong vo khong 100 ang hot ng. 15 phin bn c hn 50.000 bi vit: ting Anh, c, Php, Ba Lan, Nht, , Thy in, H Lan, B o

Nha, Ty Ban Nha, Hoa, Nga, Na Uy, Phn Lan, Esperanto v ting Vit, tng cng Wikipedia hin c hn 4,6 triu bi vit, tnh c hn 1,2 triu bi trong phin bn ting Anh (English Wikipedia). Kin trc Wikipedia Cc trang thng tin ca Wikipedia c lu tr trong mt cu trc mng.Chi tit hn, cc bi vit ca Wikipedia c t chc dng mt mng cc khi nim lin quan vi nhau v mt ng ngha v cc mc ch (category) c t chc trong mt cu trc phn cp(taxonomy) c gi l th ch Wikipedia (Wikipedia Category Graph - WCG). th bi vit(Article graph): Gia cc bi vit ca Wikipedia c cc siu lin kt vi nhau, cc siu lin kt ny c to ra do qu trnh chnh sa bi vit ca ngi s dng. Nu ta coi mi bi vit nh l mt nt v cc lin kt t mt bi vit n cc bi vit khc l cc cnh c hng chy t mt nt n cc nt khc th ta s c mt th c hng cc bi vit trn Wikipedia (pha bn phi ca hnh 3.5). 1 http://www.wikipedia.org 25 Hnh 3.2. Mi quan h gia th bi vit v th ch Wikipedia th ch (Category graph): Cc ch ca Wikipedia c t chc ging nh cu trc ca mt taxonomy (pha bn tri ca hnh 3.2). Mi mt ch c th c mt s lng ty cc ch con, mi mt ch con ny thng c xc nh bng mi quan h thng h v (Hyponymy) hay mi quan h b phn tng th (Meronymy).

V d: Ch vehicle c cc ch con l aircraft v watercraft Do , th ch (WCG) ging nh l mt mng ng ngha gia cc t tng t nh Wordnet. Mc d th ch khng hon ton c xem nh l mt cu trc phn cp do vn cn tn ti cc chu trnh, hay cc ch khng c lin kt n cc ch khc tuy nhin s lng ny l kh t. Theo kho st ca Torsten Zesch v Iryna Gurevych [ZG07] vo thng 5 nm 2006 trn Wikipedia ting c th th ch cha 99,8% s lng nt ch v ch tn ti 7 chu trnh. tng ng gia cc khi nim trong mng ng ngha Wikipedia Phng php tnh tng ng gia cc khi nim trong mng ng ngha Wikipedia c kh nhiu cc nghin cu a ra nh Ponzetto v cng s trong cc nm 2006, 2007 [SP06, PSM07], Torsten Zesch v cng s nm 2007 [ZG07, ZGM07],Cc nghin cu ny tp trung vo vic p dng v ci tin mt s o 26 ph bin v tnh tng ng t trn tp ng liu Wordnet cho vic tnh tng ng gia cc khi trn mng ng ngha Wikipedia. Cng ging nh trn Wordnet cc o ny c chia thnh hai loi o, nhm o da vo khong cch gia cc khi nim (Path based measure) nh Path Length (PL, nm 1989), Leacock & Chodorow (LC, nm 1998), Wu and Palmer (WP, nm 1994) [ZG07, SP06] v nhm o da vo ni dung thng tin (Information content based measures) nh Resnik (Res, nm 1995), Jiang and Conrath (JC, nm 1997), Lin (Lin, nm 1998) [ZG07]. Trong cc o ny, tr o Path Length khi gi tr cng nh th tng ng cng cao, cn li cc o khc gi tr tnh ton

gia 2 khi nim cng ln th tng ng cng cao. o Path Length (PL) o PL c Rada v cng s xut nm 1989 s dng di khong cch ngn nht gia hai khi nim trn th (tnh bng s cnh gia hai khi nim) th hin s gn nhau v mt ng ngha. - n1, n2: l hai khi nim cn tnh ton - l(n1,n2): khong cch ngn nht gia hai khi nim o Leacock & Chodorow (LC) o LC c Leacock v Chodorow xut nm 1998 chun ha di khong cch gia hai node bng su ca th - n1, n2: l hai khi nim cn tnh ton - depth: l di ln nht trn th - l(n1,n2): khong cch ngn nht gia hai khi nim o WP c Wu v Palmer xut nm 1994: 27 - c1, c2: l hai khi nim cn tnh ton - lcs: Khi nim thp nht trong h thng cp bc quan h is-a hay n l cha ca hai khi nim n1 v n2 - depth(lcs): l su ca khi nim cha o Resnik c Resnik xut 1995. Resnik coi tng ng ng ngha gia hai khai nim c xem nh ni dung thng tin trong nt cha gn nht ca hai khi nim

Vi c1, c2: l hai khi nim cn tnh ton v ic c tnh nh cng thc di: - hypo(n) l s cc khi nim c quan h thng h vi (hyponym) vi khi nim n v C l tng s cc khi nim c trn cy ch o JC c Jiang v Conrath xut nm 1997: - n1, n2: l hai khi nim cn tnh ton - IC c tnh nh cng thc trn o Lin c Lin xut nm 1998: - n1, n2: l hai khi nim cn tnh ton - IC c tnh nh cng thc trn 28 tng ng cu da vo mng ng ngha Wikipedia Do cc gi tr tng ng c nu trn u khng b rng buc bi khong 0,1, trong khi vic tnh tng ng cu theo phng php cosine i hi cc thnh phn thuc khong ny. Vo nm 2006, Li v cng s [LLB06] a ra hai cng thc ci tin tng ng t m khng lm mt tnh n iu. - i vi o PL, f l mt hm n iu gim, v vy: - i vi cc o khc, f l mt hm n iu tng, v vy: Trong hai hm s trn, v l hai tham s c chn l =0.2 v =0.45 Sau khi tnh c tng t t, ta a ra c vector ng ngha si cho mi cu. Gi tr ca tng thnh phn c trong vector l gi tr cao nht v tng t t gia t trong tp t chung tng ng vi thnh phn ca vector vi mi t trong cu

[LLB06]. S ging nhau v ng ngha gia 2 cu l h s cosine gia 2 vector : || || . || || . 12 12 ss Ssss= 3.4. Tm tt chng ba Trong chng ny, lun vn gii thiu khi nim v tng ng cu, phng php xy dng tng cu v mt s gii php nhm tng cng tnh ng ngha cho tng ng cu. Trong chng tip theo, lun vn i su vo xut ca tc gi cho vic tnh tng ng cu trong ting Vit v m hnh tm tt a vn bn ting Vit. 29 Chng 4. Mt s xut tng cng tnh ng ngha cho tng ng cu v p dng vo m hnh tm tt a vn ting Vit 4.1. xut tng cng tnh ng ngha cho tng ng cu ting Vit Vic xy dng cc o tng ng ng ngha c chnh xc cao thng i hi cn c cc kho ng liu ngn ng hc th hin c mi quan h ng ngha gia cc t, cc khi nim hay cc thc th nh Wordnet hoc Brown Corpus. Trong

khi , i vi x l ngn ng t nhin ting Vit hin nay, cc kho ng liu ngn ng hc nh vy vn cha c xy dng hon chnh. Chnh v vy, vic tm ra phng php xy dng cc kho ng liu tng t vi chi ph thp nht tr thnh mt vn t ra i vi cng ng x l ngn ng t nhin ting Vit. Cng vi vic nghin cu p dng hai phng php c cp mc 3.3.2 v mc 3.3.4 cho ting Vit l phn tch ch n v xy dng mng ng ngha Wikipedia, tc gi cng nghin cu v xut ra mt phng php cho php xy dng th quan h gia cc thc th (entities) da vo phng php hc bn gim st Bootstrapping trn my tm kim. 4.1.1. th thc th v m hnh xy dng th quan h thc th Web ng ngha hay tm kim thc th l nhng ti ln ang c nhiu nh nghin cu quan tm. Mt trong nhng vn ang c ch trng hin nay l lm th no c th t mt tp cc thc th, mt tp cc khi nim hoc mt tp cc thut ng chuyn ngnh c th tm kim v m rng ra c mt tp ln hn, hon chnh hn cc thc th, cc khi nim hay cc thut ng chuyn ngnh khc m c tng ng ng ngha vi tp gc ban u. V d: Trong Hnh 4.1, yu cu t ra i vi bi ton m rng thc th l tm ra cc mi quan h, cc thc th mi t cc thc th c sn nh mi quan h gia Lng Bc Bc H, Lng Bc H Ch Minh, Lng Bc Qung trng Ba nh, H Ni Qung trng Ba nh 30 Hnh 4.1. M rng mi quan h v tm kim cc thc th lin quan

T tng ca bi ton m rng thc th cng nh thng qua vic nghin cu kho st 2 mng ng ngha Wordnet v Wikipedia, chng ti quan tm ti vic xy dng th th hin mi quan h gia cc thc th vi nhau v s dng th ny nh mt mng ng ngha xy dng o tng ng ng ngha cu. Mi mt quan h gia hai thc th c xem nh l mt cnh ni trc tip gia hai nt thc th. Da vo hai nghin cu v m rng thc th da vo my tm kim ca R.Wang v W.Cohen a ra nm 2007 [WC07] v o tng ng gia cc khi nim da vo my tm kim ca Bollegala xut nm 2006 [BMI06], chng ti a ra m hnh xy dng th quan h thc th da vo my tm kim p dng gii thut hc bn gim st Bootstrapping. Di y l m hnh xy dng th quan h thc th da vo my tm kim theo xut ca chng ti: H Ni H Gm H Thnh L Thi T H Ty Lng Bc Bc H H Ch Minh Qung trng Ba nh 31

Hnh 4.2: M hnh xy dng th quan h thc th M hnh xy dng th quan h thc th gm 3 pha chnh: Pha tng tc vi cc my tm kim(Google/Yahoo): a mt s thc th t th quan h thc th a vo danh sch cc thc th ht ging. Pha x l ny nhn u vo mt truy vn c ly ra t tp cc thc th ht ging (Seed) v a truy vn ny vo cc my tm kim. V d: H Ni, H Gm, Cc my tm kim nh Google/Yahoo s tr v cc snippet tng ng vi cc cu truy vn a vo. Pha nhn dng thc th (NER): Ti pha x l ny, cc snippet s c a qua cng c nhn dng thc th pht hin cc thc th mi tn ti trong snippet. Ti bc ny, cc cng c nhn dng thc th ng mt vai tr quan trng trong qu trnh xy dng th quan h thc th. Trong Ting Anh c kh nhiu cc cng c s dng cc gii thut hc my cho 1.My tm kim Google/Yahoo Danh sch cc thc th ht ging Danh sch cc snippet 2.Nhn dng thc th

Thc th Trng s E1 . . Ek . 3.Xp hng thc th v sinh ra quan h th quan h thc th Cu truy vn 32 php nhn dng tn thc th vi chnh xc cao nh: Lingpipe Api1, OpenNLP2Tuy nhin, trong ting Vit cha tn ti cng c no nh vy, tc gi s dng mt s lut nhn dng tn thc th da vo biu thc chnh quy nh: chn cc chui k t m mi t c vit hoa v c di ln hn hai t Sau khi c c tp cc tn thc th mi pha x l tip tc thng k tn s xut hin ca cc tn thc th c. Pha nhn xp hng thc th v sinh ra quan h: Trong pha ny, tp cc tn thc th mi c sp xp li theo tn s xut hin, da vo mt ngng la chn xc nh trc pha x l s chn ra cc tn thc th c tn s xut hin vt ngng cho php ghp vi thc th u vo thnh mt

quan h. Cc thc th mi v mi quan h s c thm vo th c sn c lu tr trong c s d liu. M hnh ny s c lp lin tc cho n khi khng c mt quan h mi no c sinh ra. Cc thc th mi trong vng lp ln u tin c a vo bng tay. Cc thc th c tng a vo pha truy vn my tm kim s c nh du khng a vo trong cc ln sau. 4.1.2. tng ng ng ngha cu da vo th quan h thc th Thng qua vic nghin cu v xem xt s tng quan gia th quan h thc th do tc gi xut v hai mng ng ngha Wordnet v Wikipedia cng mt s o tng ng ng dng trn hai mng ng ngha c xut mc 3.3.3, chng ti xut mt tng ng ng ngha da vo th thc th. S tng quan gia th quan h thc th v mng ng ngha Wordnet, Wikipedia 1 Lingpipe Api. http://alias-i.com/lingpipe 2 OpenNLP. http://opennlp.sourceforge.net 33 Wordnet Wikipedia th thc th th quan h gia cc khi nim C C C Cy phn cp ch

C C Khng Ni dung thng tin ti cc khi nim C C Khng Loi quan h gia cc khi nim Bao gm hu ht cc quan h gia hai t/thc th/khi nim Quan h thng h v, quan h b phn tng th, quan h tng ng Quan h tng ng Ngn ng Ting Anh 265 ngn ng Ting Anh, Ting Vit Bng 4.1: S tng quan gia th quan h thc th, Wordnet v Wikipedia tng ng ng ngha da vo th quan h thc th Da vo s xem xt tng quan c nu bng 4.1, chng ti nhn thy vic xy dng tng ng ng ngha da vo th quan h thc th ch c th p dng

nhm cc o tng ng da vo khong cch gia cc khi nim (Path length measures). o tng ng thc th c chng ti xut da trn o LC (Leacock & Chodorow) nh c trnh by chng 3: trong : - n1, n2: l hai thc th cn tnh ton trn th - depth: l di ln nht trn th c tnh t cc thc th mi lc khi to h thng n thc th (nt) c khong cch xa nht so vi cc nt ny. 34 - l(n1,n2): khong cch ngn nht gia hai thc th. p dng cng thc tnh tng ng cu ti mc 3.3.3 ca Li v cc cng s trong nm 2006 [LLB06] xy dng tng ng cu cho th quan h thc th. Nhn xt: Mc d, th quan h thc th khng c nhiu thng tin trong mi nt thc th cng nh vic phn loi ch cho cc thc th trong th. Mc d vy, y l mt phng php t ng gim thiu c chi ph xy dng kho ng liu cng nh c th to ra c mt th c s lng nt thc ln v m rng nhanh. o tng ng ng ngha cu da vo th quan h thc th ch hn ch trong vic p dng cc o khong cch tuy nhin n c th d dng kt hp vi cc o tng ng ng ngha khc thng qua cc hm trn gia cc o. 4.2. tng ng ng ngha cu ting Vit Thng thng, xy dng cc o tng ng ng ngha tt, phng php

ph bin l s dng vic kt hp nhiu o li vi nhau thng qua mt hm tnh hng tuyn tnh. Cng thc biu din vic kt hp cc o nh sau: = i i i SimTotal (s , s ) * sim (s , s ) 1 2 1 2 Vi iu kin: = i i1 Trong : - s1, s2: l hai cu cn tnh tng ng - i: l s lng cc o tng ng kt hp li - simi: l cc o tng ng thnh phn - i: l cc hng s trn nm trong ngng [0,1] th hin s ng gp ca cc o tng ng thnh phn vi o SimTotal. Cc tham s ny 35 phi tha mn iu kin, tng tt c cc hng s trong cng thc bng 1 (Cc hng s ny s c c lng trong qu trnh thc nghim). Di y l cc o c s dng tin hnh nh gi, tm ra o tng ng ng ngha ph hp nht vi ting Vit. Trong cc o ny, 5 v 6 l cc o kt hp. STT o M t Hng s trn c

chn qua thc nghim 1 Cosine [Cos] tng ng Cosine 2 Hidden topic [Hidden] tng ng da vo ch n kt hp cosine Cos=0.6 Hidden=0.4 3 Wikipedia [Wiki] tng ng da vo mng ng ngha Wikipedia 4 Entity Graph [EntG] tng ng da vo th quan h thc th 5 Hidden topic & Wikipedia & Entity Graph [All_1] tng ng kt hp 3 o 1,2,3 Cos=0.3 Hidden=0.3 Wiki=0.2

EntG=0.2 6 Hidden topic & Wikipedia & Entity Graph & Dictionary [All_2] tng ng kt hp 3 o 1,2,3 v tng ng da vo t in ng ngha Cos=0.3 Hidden=0.2 Wiki=0.2 EntG=0.2 Dictionary=0.1 Bng 4.2. Danh sch cc o tng ng ng ngha cu 4.3. M hnh tm tt a vn bn ting Vit T nhng nghin cu c nu cc mc trn, tc gi a ra mt m hnh tm tt a vn bn cho cc cm d liu trang web ting Vit tr v t my tm kim. 36 Hnh 4.3. M hnh tm tt a vn bn ting Vit M hnh tm tt a vn bn ting Vit nhn u vo l cc cm d liu trang web ting Vit c tr v t qu trnh phn cm trn my tm kim. Mi cm d liu c nhn ca cm v cc trang web c ni dung lin quan n nhn cm. Mi mt trang

web c coi nh l mt ti liu. M hnh tm tt gm ba pha chnh: Pha tin x l d liu Pha x l ny nhn u vo tp cc trang web thuc mt cm d liu. Cc qu trnh c thc hin theo cc bc sau: - Loi b cc trang web c ni dung trng lp. - Lc nhiu, loi b cc th HTML, ly ni dung chnh ca trang Web. - Tch t, tch cu cc vn bn c c bng cng c JvnTextpro ca tc gi Nguyn Cm T. - Tch t i vi nhn cm. Pha sp xp vn bn v cu theo quan trng Danh sch cc cu Danh sch cc vn bn Nhn cm 1.Tin x l Cu Trng s S1 . . Sk . 2.Sp xp vn

bn v cu theo quan trng Vn bn Trng s D1 . Dk . Cm d liu tr v t my tm kim Vn bn tm tt 3.Sinh vn bn tm tt 37 Pha ny nhn d liu u vo l cc vn bn v nhn cm qua tin x l, u ra l danh sch cc cu, cc vn bn c sp xp li theo quan trng v mt ng ngha. Vic sp xp cc vn bn v cu theo quan trng bn cnh vic loi b s chng cho gia cc vn bn l mt bc quan trng trong m hnh tm tt a vn bn. Trong m hnh ny, phng php c s dng sp xp li vn bn v cu l s kt hp ca cc nghin cu c nu ra ti mc 2.4.1 v 2.4.2 vi cc o tng ng ng ngha c nu mc 4.2.

Pha sinh vn bn tm tt Trong pha sinh vn bn tm tt, cc cu c sp xp c sp xp pha trn s c sp xp li. Trng s quan trng ca cu s c b sung thm trng s ca vn bn cha cu y, vic ny s gip vn bn tm tt khng c s chng cho v mt ni dung. ScoreTotal l cng thc tnh li quan trng ca cu: ( ) ( * ( ) (1 ) * ( )) k k s D i ScoreTotal s Score s Score D k i = + - Sk: l cu cn tnh quan trng. - Di: l vn bn cha sk. - Score(sk), Score(Di): l trng s quan trng ca sk v Di c tnh pha trc. - : l cc hng s trn nm trong ngng [0,1] th hin s ng gp ca hai o Score(sk) v Score(Di) (Cc hng s ny s c c lng trong qu trnh thc nghim). Sau khi c quan trng cu, cc cu s c sp xp theo th t t ln n nh theo o ScoreTotal, trch s lng cc cu c quan trng cao nht theo t l cho trc. Cc cu sau khi c trch ra s c sp xp vo trong mt vn bn theo trnh t u tin sau y: - u tin cc cu thuc vn bn c o Score(Di) cao hn s c xp ln u vn bn. 38

- u tin theo th t cu t trn xung di trong cng mt vn bn. 4.4. M hnh hi p t ng ting Vit p dng tm tt a vn bn Mt trong nhng vn nhn c s quan tm ca cng ng nghin cu tm tt a vn bn l vic ng dng tm tt a vn bn xy dng h thng hi p t ng(Question Answering System). Cc nghin cu ny s dng tm tt a vn bn tm ra cc cu tr li trong mt tp d liu tri thc nn. Bn cnh vic sinh cc vn bn tr li cho cu hi, cc nghin cu ny cng gip cho vic nh gi cc m hnh tm tt a vn bn c d dng v khch quan hn. Thay v cn c cc chuyn gia ngn ng hc nh gi chnh xc ca cc vn bn sinh ra t m hnh tm tt, vic nh gi by gi ch cn l vic xc nh xem cu tr li c tr li chnh xc cu hi a vo hay khng. Qua qu trnh kho st kt qu tr v t cc my tm kim nh Google, Yahoo i vi cc mt s cu hi t nhin, tc gi nhn thy s tn ti ca cc cu tr li trong danh sch cc snippet hay cc trang web tr v. Chnh t nhn nh trn, tc gi xut m hnh hi p t ng ting Vit da trn vic tm tt a vn bn cc kt qu tr v t my tm kim tm ra kt qu tr li. Hnh 4.4. M hnh hi p t ng ting Vit p dng tm tt a vn bn M hnh hi p t ng ting Vit gm 3 pha chnh: Danh sch Snippet/Trang Web Cu hi t nhin

1.Tng tc my tm kim Google/Yahoo 2.Tin x l Danh sch cu/ti liu 3.Tm tt a vn bn Kt qu tr li cu hi 39 Pha tng tc vi my tm kim: Pha ny nhn cu hi t nhin ca ngi s dng, tin hnh tch t v bin i thnh cu truy vn a vo cc my tm kim Google v Yahoo. Cc snippet, trang web ting Vit tr v t my tm kim s c ti v v a qua pha tin x l. Pha tin x l: Cc bc x l ti pha ny: - Lc nhiu, loi b cc th HTML, ly ni dung chnh ca trang Web. - Tch t, tch cu cc vn bn c c t trang web v snippet Tm tt a vn bn: Pha ny s dng m hnh tm tt a vn bn ting Vit c nu mc 4.3 vi u vo l cu hi t nhin c xem nh nhn cm v tp cc vn bn trch xut t trang web qua pha tin x l c xem nh cm d liu. Kt qu u ra ca m

hnh tm tt s l cu c trng s cao nht qua trnh sp xp, cu ny c xem nh l cu tr li cho cu hi. 4.5. Tm tt chng bn Trong chng ny, lun vn trnh by cc xut ca tc gi trong vic xy dng tng ng ng ngha cu cho ting Vit, m hnh tm tt a vn bn v m hnh hi p t ng p dng tm tt a vn bn. Trong chng tip theo, lun vn s trnh by cc thc nghim chng minh tnh kh thi v trin vng ca bi ton tm tt a vn bn cho ting Vit v m hnh h thng hi p ting Vit. 40 Chng 5. Thc nghim v nh gi 5.1. Mi trng thc nghim Qu trnh thc nghim ca lun vn c thc hin trn my tnh c cu hnh: - Chip: Intel Core 2 Duo 2.53 Ghz x 2 - Ram: 3 GB - H iu hnh: Windows Vista - Phn mm lp trnh: MyEclipse 7.5, Java 1.6 Cc cng c phn mm v ngun m c lit k trong bng di y: STT Tn phn mm M t 1 JSum Tc gi: Trn Mai V Cng dng: Cng c c 2 nhm chc nng chnh l: - Xy dng mng ng ngha Wikipedia v th quan h thc th

- Tm tt a vn bn da trn cc o tng ng ng ngha nh: suy lun ch n, mng ng ngha wikipedia, th thc th, ontology 2 VQA Tc gi: Trn Mai V v Nguyn c Vinh Cng dng: H thng hi p ting Vit da trn 2 phng php: tm tt a vn bn v trch xut quan h ng ngha [VVU09] 3 JVnTextpro Tc gi: Nguyn Cm T Cng dng: Tch t, tch cu i vi cc vn bn ting Vit 41 4 JGibbsLDA Tc gi: Nguyn Cm T Cng dng: Xy dng v phn tch ch n 5 Mulgara Tc gi: Northrop Grumman Corporation Website: http://www.mulgara.org Cng dng: Lu tr cc mng ng ngha Wikipedia v th quan h thc th trn nn tng cng ngh semantic web 6 Lingpipe Tc gi: Alias-i Website: http://alias-i.com/lingpipe Cng dng: Nhn dng tn thc th (NER) trong ting Anh Bng 5.1. Cc cng c phn mm s dng trong qu trnh thc nghim 5.2. Qu trnh thc nghim

5.2.1. Thc nghim phn tch ch n D liu phn tch ch n: B d liu 125 topic (vnexp-lda4-125topics) c phn tch bng JGibbsLDA trn kho d liu cc bi bo thu thp t trang web Vnexpress Sau qu trnh phn tch ch n cc cu s c xc nh nm trong cc ch xc nh trc trong b d liu ch n. V d: STT Cu Cc ch trong cu 1 Ct gim thu Topic_48 Topic_97 2 Tip tc gim thu nhiu mt hng nhp khu Topic_97 3 Nhng mt hng nm trong din ct gim thu trong thi gian ti gm ru, bia, thuc l, c Topic_16 Topic_33 Topic_54 Topic_62 Topic_97 Topic_106 42 ph, du thc vt, tht ch bin... Topic_123 4 Theo yu cu ca Chnh ph Lin b Ti chnh Cng thng tip tc thc hin l trnh gi th trng i vi mt hng chin lc c s kim sot ca Nh Nc, nhm khuyn khch cnh tranh, hn ch c quyn.

Topic_13 Topic_33 Topic_41 Topic_47 Topic_67 Topic_78 topic_105 Topic_105 Topic_115 Topic_122 Bng 5.3. Kt qu phn tch ch n D dng nhn thy cc cu trn c ni dung lin quan n ch Thu u thy xut hin Topic_97 qu trnh phn tch ch . Di y l 20 t c phn phi xc sut cao trong Topic_97: Topic 97: 1. thng_mi 0.051798 2. wto 0.038748 3. m_phn 0.028651 4. gia_nhp 0.021578 5. thnh_vin 0.017416 6. nhp_khu 0.015039 7. cam_kt 0.014520 8. thu 0.013109 9. xut_khu 0.011164 10. vn_ 0.010848 11. kinh_t 0.010271 12. hip_nh 0.010070 13. pht_trin 0.009695

14. t_do 0.009162 15. t_chc 0.007909 16. dt 0.007175 17. asean 0.007131 18. t 0.007117 19. b_trng 0.006872 20. nng_nghip 0.006757 Bng 5.4: 20 t c phn phi xc sut cao trong Topic n 97 5.2.2. Thc nghim xy dng th quan h thc th D liu xy dng th quan h thc th: D liu mi: 200 thc th ting Vit v 200 thc th ting Anh thuc cc lnh vc: a danh, t chc, nhn vt. Thc nghim l kt qu ca qu trnh thc thi m hnh xy dng th quan h thc th c xut ti mc 4.1.1 c ci t. Trong thc nghim ny, th 43 quan h thc th c xy dng cho 2 ngn ng ting Anh v ting Vit. Phng php nhn dng tn thc th(NER) c p dng m hnh ny: i vi ting Anh: m hnh hc my CRF, s dng b cng c Lingpipe Api. i vi ting Vit: s dng biu thc chnh quy. Ngn ng S lng thu c S lng quan h Thi gian thc thi Ting Anh 48.365 thc th 72.619 quan h 5 ngy Ting Vit 21.693 thc th 32.774 quan h 5 ngy

Bng 5.5. Kt qu d liu thu c ca m hnh xy dng th quan h thc th 5.2.3. Thc nghim nh gi cc o tng ng D liu Wikipedia: 99.679 bi vit trn Wikipedia Ting Vit (23/10/2009) Download ti a ch: http://download.wikimedia.org/viwiki/20091023 D liu t in: T in ng ngha: gm 2393 nhm t ng ngha c pht trin da trn T in ng ngha ca Nguyn Vn Tu, NXB i hc v Trung hc chuyn nghip, 1985. D liu nh gi o tng ng ng ngha cu: S dng 20 cm: mi cm gm 3-5 cp cu, c nh gi bng tay theo th t v tng ng v mt ng ngha (Th t cng thp tng ng cng cao). V d: S th t Cu th nht Cu th hai Xp hng bng tay 1 Ti thch H Ni Anh yu H Gm 1 2 Ti thch H Ni Em mn ngi H Thnh 2 44 3 Ti thch H Ni C y ngm nhn Thp ra 3 4 Ti thch H Ni Bn y thch H Giang 4 Bng 5.6. Mt cm d liu dng nh gi tng ng ng ngha Trong thc nghim ny, cc o tng ng c nh gi nu trong bng 4.2.

Cc bc thc nghim: - Tnh o tng ng gia cc cp cu bng cc o khc nhau, sp xp theo th t cng gn v mt ng ngha th th t cng thp. - chnh xc c tnh bng s lng cc cu gi ng th t xp hng bng tay c gn cho tp d liu thc nghim. S th t ca cu Cos EntG Wiki Hidden All_1 All_2 1322221 2231112 3344433 4113344 Bng 5.7. Kt qu nh gi cc o trn cm d liu bng 5.2 Trong vic nh gi trn 10 cm ting Anh, tc gi ch s dng hai o tng ng l Cosine v th quan h thc nh gi. Ngn ng Cos Hidden Wiki EntG All_1 All_2 Ting Vit 56% 72% 76% 69% 81% 89% Ting Anh 68% ~ ~ 83% ~ ~ Bng 5.8. chnh xc nh gi trn 20 cm d liu ting Vit v 10 cm ting Anh Kt qu thc nghim cho thy vic o tng ng ng ngha All_2 cho kt qu tt hn cc o khc. Trong cc thc nghim tip theo, tc gi s dng All_2 lm o tng ng ng ngha chnh. 45 5.2.4. Thc nghim nh gi chnh xc ca m hnh tm tt a vn bn

D liu nh gi m hnh tm tt a vn bn: S dng 5 cm tr v t qu trnh phn cm trn my tm kim ting Vit VnSen: mi cm gm 8-10 vn bn. Cc vn bn trong cm v 20 cu quan trng nht trong vn bn s c sp xp bng tay da vo tng ng ca gia vn bn/cu vi nhn cm. chnh xc c tnh bng s lng cc vn bn/cu gi ng th t xp hng bng tay c gn cho tp d liu thc nghim. Cm S lng vn bn S lng cu Nhn cm chnh xc th t vn bn chnh xc th t ca 20 cu quan trng 1 10 216 Li sut tit kim 80% 80% 2 8 116 Ct gim thu 87.5% 85%

3 8 127 Cng c tm kim Google 87.5% 80% 4 8 101 Laptop gi r 75% 75% 5 8 86 Dch tiu chy 75% 70% Bng 5.9. nh gi kt qu th t vn bn v th t ca 20 cu quan trng nht i vi cm vn bn c nhn Li sut tit kim, vi t l trch xut l 10 cu, kt qu tm tt tr v theo nh gi trc quan l tng i tt. Vn bn tm tt [8][7] Hm qua, Dong A Bank thng bo tng li sut tin gi tit kim VND dnh cho khch hng c nhn vi mc tng bnh qun 0,06% mi thng. [9][2] "Li sut ngn hng ang cao. Ai cng mun bn tho c phiu ly tin gi tit kim 46 nhng khng c, ti phi vt v lm mi bn thnh cng", ch Phc ci vui v. [1][1] Li sut tit kim ng mc 15% [10][1] x n ngn hng gi tin ngn hn [10][25] Tuy nhin, nhiu nh bng cng c on lng gi tin vi k hn ngn s chim u th hn so vi gi tit kim lu di. [10][4] Cn ti Ngn hng Phng ng, ch Linh chun b sn 70 triu ng t cui tun gi tit kim linh hot 12 thng.

[2][23] Mt lnh o ca ngn hng VP nhn nh: Trong tun ny s c nhiu bin ng v li sut v cc ngn hng theo di ng thi ca nhau iu chnh kp thi mc li sut. Ch c nh vy mi c th gi chn c khch hng. [7][19] Mi thng doanh nghip thanh ton li thng cho nh bng gn 10 triu ng. [7][11] Li sut cho vay ca cc ngn hng ang c iu chnh, cng vi tnh hnh mt s nh bng ngng cho vay tc ng tc thi n cc doanh nghip ang c nhu cu vay tin vo thi im ny. [7][1] Lm th kt v ngn hng iu chnh cho vay Bng 5.10. Kt qu tm tt tr v theo t l trch xut l 10 cu (hai ch s u dng tng ng l th t ca vn bn trong cm v th t ca cu trong vn bn). 5.2.5. Thc nghim nh gi chnh xc ca m hnh hi p D liu nh gi h thng hi p: D liu: 500 cu hi dch c la chn v chnh sa t b d liu ca TREC (Ly t b cng c OpenEphyra). Cc cu hi c a kim tra trc trn cc my tm kim xem c xut hin cu tr li trong cc snippet tr v hay khng. tng ng S tr li ng chnh xc Thi gian tr li trung bnh Cos 67 13.4% 30 giy

47 Hidden 238 47.6% 2 pht Wiki 142 28.4% 25 pht EntG 167 33.4% 15 pht All_1 318 63.6% 35 pht All_2 376 75.2% 40 pht Bng 5.11. chnh xc ca m hnh hi p da vo tm tt a vn bn cho snippet tng ng S tr li ng chnh xc Thi gian tr li trung bnh Cos 101 21.6% 2 pht Hidden 356 71.2% 15 pht Wiki 104 20.8% 45pht EntG 125 25.0% 1 gi 15 pht All_1 359 71.8% 2 gi 30 pht All_2 389 77.8% 3 gi *Tc trn khng tnh thi gian download trang web Bng 5.12. chnh xc ca m hnh hi p da vo tm tt a vn bn cho trang web Cu hi Cu tr li Ngi u tin tm ra chu m ? Ai cng bit C-lm-b l ngi u tin tm ra chu M Nhc s sng tc bi ht ngi h ni ? Ngi H Ni l mt bi ht do nhc s Nguyn nh Thi sng tc

C chua c tc dng g i vi sc khe ? C chua c tc dng phng chng ung th v, ung th d dy 48 Bc H sang php nm no ? Ma h nm 1911, Bc t chn ln t Php, i vi Bc Ngi sng lp ra google ? T Financial Times bnh chn hai nh ng sng lp ra cng c tm kim Google, Sergey Brin v Larry Page, u 32 tui l Ngi n ng ca nm Bng 5.13. Danh sch mt s cu kt qu tr li ca h thng hi p 49 Kt lun Nhng vn c gii quyt trong lun vn Lun vn tin hnh nghin cu gii quyt bi ton tm tt a vn bn ting Vit da vo trch xut cu. Bi ton ny c xc nh l mt bi ton c phc tp cao v l nn tng ca nhiu ng dng thc t. Phng php gii quyt ca lun vn tp trung vo vic tng cng tnh ng ngha cho o tng ng gia hai cu trong qu trnh trch xut cu quan trng ca tp d liu u vo. Da vo cc nghin cu v ch n, mng ng ngha Wikipedia v mt phng php do tc gi lun vn xut, lun vn a ra mt o tng ng

ng ngha cu xy dng m hnh tm tt a vn bn ting Vit. Hn na, lun vn cng trnh by m hnh h thng hi p ting Vit p dng tm tt a vn bn s dng d liu trn cc my tm kim ni ting nh Google, Yahoo lm tri thc nn. Qu trnh thc nghim t c kt qu kh quan, cho thy tnh ng n ca vic la chn cng nh kt hp cc phng php, ng thi ha hn nhiu tim nng pht trin hon thin. Cng vic nghin cu trong tng lai - Pht trin v m rng th quan h thc th, nghin cu v xy dng cy phn cp ch thc th cho th. - Nghin cu v p dng mt s gii thut tnh ton tng ng ng ngha trn mng ng ngha ci tin m hnh tm tt a vn bn ting Vit. - Ci tin qu trnh lu tr v nh ch mc tng tc cho cc vic tm kim v tnh ton trn th, qua tng tc tr li cu hi cho m hnh hi p ting Vit. - Xy dng v trin khai h thng hi p ting Vit cho ngi s dng. 50 Cc cng trnh khoa hc v sn phm cng b [VVU09] Vu Tran Mai, Vinh Nguyen Van, Uyen Pham Thu, Oanh Tran Thi and Thuy Quang Ha (2009). An Experimental Study of Vietnamese Question Answering System, International Conference on Asian Language Processing (IALP 2009): 152-155, Dec 7-9, 2009, Singapore.

[VUH08] Trn Mai V, Phm Th Thu Uyn, Hong Minh Hin, H Quang Thy (2008). tng ng ng ngha gia hai cu v p dng vo bi ton s dng tm tt a vn bn nh gi cht lng phn cm d liu trn my tm kim VNSEN, Hi tho Cng ngh Thng tin & Truyn thng ln th nht (ICTFIT08): 94-102, HKHTN, HQG TP H Ch Minh, Thnh ph H Ch Minh, 2008. Sn phm phn mm [VTTV09] Trn Mai V, V Tin Thnh, Trn o Thi, Nguyn c Vinh (2009). My tm kim gi c, http://vngia.com 51 Ti liu tham kho Ting Vit [MB09] Lng Chi Mai v H T Bo (2009). Bo co Tng kt ti KC.01.01/0610 "Nghin cu pht trin mt s sn phm thit yu v x l ting ni v vn bn ting Vit" v V x l ting Vit trong cng ngh thng tin (2006), Vin Cng ngh Thng tin, Vin Khoa hc v Cng ngh Vit Nam, 2009. Ting Anh [Ba07] Barry Schiffman (2007). Summarization for Q&A at Columbia University for DUC 2007, In Document Understanding Conference 2007 (DUC07), Rochester, NY, April 26-27, 2007. [BE97] Regina Barzilay and Michael Elhadad. Using Lexical Chains for Text Summarization, In Advances in Automatic Text Summarization (Inderjeet Mani and Mark T. Maybury, editors): 111121, The MIT Press, 1999.

[BKO07] Blake,C., Kampov, J., Orphanides, A., West,D., & Lown, C. (2007). UNCCH at DUC 2007: Query Expansion, Lexical Simplification, and Sentence Selection Strategies for Multi-Document Summarization, In DUC07. [BL06] Blei, M. and Lafferty, J. (2006). Dynamic Topic Models, In the 23th International Conference on Machine Learning, Pittsburgh, PA. [BME02] Regina Barzilay, Noemie Elhadad, and Kathleen R. McKeown (2002). Inferring strategies for sentence ordering in multidocument news summarization, Journal of Artificial Intelligence Research: 3555, 2002. [BME99] Barzilay R., McKeown K., and Elhadad M. Information fusion in the context of multidocument summarization, Proceedings of the 37th annual meeting of the Association for Computational Linguistics: 550557, New Brunswick, New Jersey, 1999. 52 [BMI06] D. Bollegara, Y. Matsuo, and M. Ishizuka (2006). Extracting key phrases to disambiguate personal names on the web, In CICLing 2006. [CG98] Jaime Carbonell, Jade Goldstein (1998). The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries, In SIGIR-98, Melbourne, Australia, Aug. 1998. [CSO01] John M Conroy, Judith D Schlesinger, Dianne P O'Leary, Mary Ellen Okurowski (2001). Using HMM and Logis-tic Regression to Generate Extract Summaries for DUC, In DUC 01, Natl Inst. of Standards and Technology, 2001. [Ed69] H. Edmundson (1969). New methods in automatic abstracting, Journal of

ACM, 16 (2):264-285, 1969. [EWK] Website: http://en.wikipedia.org/wiki/Multi-document_summarization. [FMN07] K. Filippova, M. Mieskes, V. Nastase, S. Paolo Ponzetto, M. Strube (2007). Cascaded Filtering for Topic-Driven Multi-Document Summarization, In EML Research gGmbH, 2007. [GMC00] Jade Goldstein, Vibhu Mittal, Jaime Carbonell, Mark Kantrowitz (2000). Multi-Document Summarization By Sentence Extraction, 2000. [HHM08] Phan Xuan Hieu, Susumu Horiguchi, Nguyen Le Minh (2008). Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections, In The 17th International World Wide Web Conference, 2008. [HMR05] B. Hachey, G. Murray, D. Reitter (2005). Query-Oriented Multi-Document Summarization With a Very Large Latent Semantic Space, In The Embra System at DUC, 2005. [Ji98] H. Jing (1998). Summary generation through intelligent cutting and pasting of the input document, Technical Report, Columbia University, 1998. [KST02] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu (2002). Bleu: a method for automatic evaluation of machine translation, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL): 311318, 2002. 53 [LH03] Chin-Yew Lin and Eduard Hovy (2003). Automatic evaluation of summaries using n-gram co-occurrence statistics, In Human Technology Coference 2003.

[LH97] Chin-Yew Lin and Eduard Hovy (1997). Identifying topics by position, Fifth Conference on Applied Natural Language Processing: 283290, 1997. [LLB06] Yuhua Li, David McLean, Zuhair Bandar, James O'Shea, Keeley A. Crockett (2006). Sentence Similarity Based on Semantic Nets and Corpus Statistics, IEEE Trans. Knowl. Data Eng. 18(8): 1138-1150. [Lu58] H. Luhn (1958). The automatic creation of literature abstracts, IBM Journal of Research and Development, 2(2):159-165, 1958. [Ma01] Inderjeet Mani (2001). Automatic Summarization, John Benjamins Publishing Co., 2001. [Mi04] Nguyen Le Minh (2004). Statistical Machine Learning Approaches to Cross Language Text Summarization, PhD Thesis, School of Information Science Japan Advanced Institute of Science and Technology, September 2004. [MM99] Inderjeet Mani and Mark T. Maybury (eds) (1999). Advances in Automatic Text Summarization, MIT Press, 1999, ISBN 0-262-13359-8. [MR95] Kathleen R. McKeown and Dragomir R. Radev (1995). Generating summaries of multiple news articles, ACM Conference on Research and Development in Information Retrieval (SIGIR95): 7482, Seattle, Washington, July 1995. [PKC95] Jan O. Pendersen, Kupiec Julian and Francine Chen (1995). A trainable document summarizer, Research and Development in Information Retrieval: 68 73, 1995. [PSM07] Ponzetto, Simone Paolo, and Michael Strube (2007). Knowledge Derived

from Wikipedia For Computing Semantic Relatedness, Journal of Artificial Intelligence Research, 30: 181-212, 2007. 54 [Ra00] Dragomir Radev (2000). A common theory of information fusion from multiple text sources, step one: Cross-document structure, In 1st ACL SIGDIAL Workshop on Discourse and Dialogue, Hong Kong, October 2000. [RFF05] Francisco J. Ribadas, Manuel Vilares Ferro, Jess Vilares Ferro (2005). Semantic Similarity Between Sentences Through Approximate Tree Matching. IbPRIA (2): 638-646, 2005. [RJS04] Dragomir R. Radev, Hongyan Jing, Malgorzata Stys, and Daniel Tam (2004). Centroid-based summarization of multiple documents, Information Processing and Management, 40:919938, December 2004. [SD08] P. Senellart and V. D. Blondel (2008). Automatic discovery of similar words. Survey of Text Mining II: Clustering, Classification and Retrieval (M. W. Berry and M. Castellanos, editors): 2544, Springer-Verlag, January 2008. [Sen07] Pierre Senellart (2007). Understanding the Hidden Web, PhD thesis, Universit Paris-Sud, Orsay, France, December 2007. [SP06] Strube, M. & S. P. Ponzetto (2006). WikiRelate! Computing semantic relatedness using Wikipedia, In Proc. of AAAI-06, 2006. [STP06] Krishna Sapkota, Laxman Thapa, Shailesh Bdr. Pandey (2006). Efficient Information Retrieval Using Measures of Semantic Similarity, Conference on Software, Knowledge, Information Management and Applications: 94-98, Chiang

Mai, Thailand, December 2006. [Su05] Sudarshan Lamkhede. Multi-document summarization using concept chain graphs, Master Thesis, Faculty of the Graduate School of the State University of New York at Buffalo, September 2005. [Tu08] Nguyen Cam Tu (2008). Hidden Topic Discovery Toward Classification And Clustering In Vietnamese Web Documents, Master Thesis, Coltech of Technology, Viet Nam National University, Ha Noi, Viet Nam, 2008. 55 [VSB06] Lucy Vanderwende, Hisami Suzuki, Chris Brockett (2006). Task-Focused Summarization with Sentence Simplification and Lexical Expansion, Microsoft Research at DUC2006, 2006. [WC07] R. Wang and W. Cohen (2007). Language-independent set expansion of named entities using the web, In ICDM07, 2007. [YYL07] J.-C. Ying, S.-J. Yen, Y.-S. Lee, Y.-C. Wu, J.-C. Yang (2007). Language Model Passage Retrieval for Question-Oriented Multi Document Summarization, DUC 07, 2007. [ZG07] T. Zesch and I. Gurevych (2007). Analysis of the Wikipedia Category Graph for NLP Applications, In Proc. of the TextGraphs-2 Workshop, NAACL-HLT, 2007. [ZGM07] Torsten Zesch, Iryna Gurevych, and Max Muhlhauser (2007). Comparing Wikipedia and German Word-net by Evaluating Semantic Relatedness on Multiple Datasets, In Proceedings of NAACL-HLT, 2007.

You might also like