Professional Documents
Culture Documents
Thura Aung
BCIS ( Comp & Info Sys)
Outline
Abstract
1. Introduction
3. Purpose
7. Conclusion
8. Reference
Abstract
Myanmar search
Myanmar should engine on par
also have language with other Google
search engine. Search Engine.
Introduction
Search engines for Myanmar
Language are still being
developed.
Name of former Myanmar search
engines are as follows.
Sr.No. Website URL
1 http://www.searchmyanmar.com
2 http://www.myanmarcrawler.com
3 http://search.mymyanmar.net
4 http://www.etrademyanmar.com
5 http://dir.yahoo.com/Regional/Countries/My
anmar__Burma/_
6 http://www.dmoz.org/Regional/Asia/Myanm
ar
7 http://myanmar-myanmar.com
Purpose
Analyze previous
To describes initial
developer idea
development of
that are relating
Myanmar Search
algorithm of
Engine.
search and word
segmentation.
Many problems
found during the
implementation
process and
explore these.
Development history of Myanmar search engine
Date Developers’ Name Title
October 02, Tun Thura Thet; Jin- Word segmentation for the Myanmar
2007 Cheon Na; Wunna Ko Ko language
Syllable Segmentation:
Tun Thura
-Consider on base character, and pre-base character, a post-base
Thet; Jin-
2 character, an above-base character and a below-base character.
Cheon Na;
-A dictionary-based statistical approach for syllable merging and a
Wunna Ko Ko, rule-based heuristic approach for syllable segmentation.
Word -99.05% recall, 98.94% precision and 98.99% F-measure.
segmentation
for the Word Segmentation:
Myanmar
-Broken down into sentences as a phrases by looking at
language punctuation marks and spaces.
-Search engines require documents to be indexed by words,
words of query compared indexed words of documents.
Development history of Myanmar search engine
Sr. Title Findings
No.
3 Word Segmentation:
Hla Hla Htay,
Kavi Narayana -Myanmar sentences can be tokenized by eliminating stop words.
Murthy, -It is a longest sentence matching and recognition for stop word that
Myanmar is leading to correct word segmentation.
Word -Two approaches for the segmentation. One based on a list of stop
Segmentation, words and other using n-grams of syllables.
ICCA 2006 -Collected 1216 stop words, 4550 syllable words and 99% accuracy.
-Collected about 100,000 words
-Achieved about 65% accuracy in word hypothesis.
4 Syllable Segmentation:
Zin Maung
-Syllable segmentation algorithm based on syllable structure of Myanmar
Maung,
script and Rule-based approach for segmentation.
Yoshiki -The corpus contains a total of 32,238 Myanmar syllables.
Mikami, A -Accuracy rate of 99.96% for segmentation.
Rule-based Word Segmentation:
Syllable
Segmentation -Program converts the input text string into equivalent sequence
of category using CMCACV for Myanmar.
of Myanmar -Dictionary-dependent techniques of word segmentation.
Text -A segmentation program break pair of characters comparing the
input character sequence tables.
Development history of Myanmar search engine
Sr.
Title Findings
No.
5 Pann Yu Mon, -Purpose of Language Specific Crawler (LSC) is maximum
Language collection of web pages.
Specific -Multi-threaded Crawler and save URLs in CSV file.
-Save pages content in Dearby database.
Crawler for
-Crawler accuracy rate is 93%.
Myanmar Web -Indexing is assigned to documents of a corpus.
Pages -Keyword saves for web pages and belongs to database.
-Unicode need for Myanmar language content for
transliteration of encoding to Myanmar Unicode.
Pann Yu Mon, -Collected Myanmar words from documents on Websites and to
6 Maung Maung know words frequently used and research based on ASCII
Thant, Ohnmar format.
Htun Pe, San Ko -Used Myanmar-English dictionary index words.
Oo, Yoshiki
Mikami, Statistical Word Segmentation:
Analysis of
Myanmar Words -Word segmentation program for Myanmar text based on longest
on the World Wide string matching algorithm.
Web for Search -Identified total 766,892 Myanmar words (12,211 unique
Engine headwords).
Development, 5,861 words (0.76%) were not identified. Accuracy is 99.24%.
ICCA 2009
Development history of Myanmar search engine
Sr.
Title Findings
No.
7 Myanmar -Development of Myanmar Search Engine based on
Search Engine Google API (Application Interface).
and Myanmar -Myanmar English bilingual type search engine.
Web Directory -Web search query using ZawGyi font, this search
Website, engine result outcome is web pages that are writing in
developed by ZawGyi font.
Myanmar .Net -Using Myanmar 3 and other Myanmar Unicode font, this
Search Engine query result outcome will be web pages
that are writing in Myanmar 3 Unicode font and other.
-Detail could be sought by surfing http://myanmar-
myanmar.com.
Problems occurs in Myanmar Search Engine
Problems or requirements Overcome these problem
• Some web sites using different Converting program required.
Non standard Unicode
Web crawler program need to
• Development of Web crawler. craw dynamic Myanmar web
pages that are written in verity of
Myanmar fonts and show crawling
depth or level of website.
• Indexing technique for Myanmar Lucene indexing based on analyzer
words. and choose indexing method for
Myanmar pages (i.e. first character
index structure)
• Link ranking system for Search
Engine Need to develop Lucene scoring
system for Myanmar pages.
DB
Application
Index Search
Documents Index
Index
Query
Field Hits
Lucene API
Analyzers in Lucene
Checking
Syntax
errors
Input Search
Query 1.Word result
Segmentation
2.Part-Of-
Speech
Lucene Search
3.Steeming
Engine
……
Main function of system is stop words removing; syllable breaking and words
break rules
Works as a analyzer for Lucene between Unicode font and Operating
System for search on WWW
Tokenization or word segmentation for
Indexing process
synonyms, antonyms
words
Original words
Myanmar
Tokens Search
Stemming Engine
Program
Alternative words
Technical terms
Previous Unicode Converting Program
version
Unicode
fonts
webpage
e.g. v.4.0
Converting Unicode Crawling
Program Standard and
Encoding Indexing
Non-
Unicode
fonts
webpage
• Mozilla FireFox
• fully support, language pack available (ver 1.4)
Myanmar Unicode font and Input Method Editor
Linux
inux Myanmar Pango Module font name is
Masterpiece UniSan.
Analysis Myanmar Words
1. Stemming words or new derived words (for example big (ၾကိးေသာ) [kji ], bigger
(ပိုၾကီးေသာ) [pou kji], biggest (Aၾကီးဆံုး) [a kji hsoun] )
3. Synonyms, Antonyms words (for example big (ၾကီးသည္) [a kji], small (ေသးငယ္သည္) [a
tha])
4. Technical terms or technical words (for example traditional medicine name (cough tablet
(ေခ်ာင္းဆိုးေပ်ာင္ေဆး) [chaun: hsou: hsei:]), traditional product name ( mohinga
(မုန္႔ဟင္းခါး) [mou. hin: ga: ]), traditional name for engineering words ( tape measure
(ေပၾကိဳး)[pei gjou:])
5. Combined words (for example cook rice is not combination of cook and rice)
(ထမင္းခ်က္ျခင္းသည္ ခ်က္ျပဳတ္ျခင္း ႏွင့္ ထမင္း ေပါင္းထားျခင္း မဟုတ္)
6. Loan Words (for example computer (ကြန္ပ်ဴတာ) [kun pju ta], sub-committee
(ဆပ္ေကာ္မတီ) [hsa _ ko ma ti], cherry (ခ်ယ္ရီ) [che ri], bureaucracy (ဗ်ဴရိုကေရစီ) [bju rou
karei si], order (ေA ာ္ဒါ) [o da]), opera (ေA ာ္ပရာ) [o para])
7. Lemmatization word (for example play (ကစားသည္) [gaza], will play (ကစားလိမ့္မည္) [gaza
mji], play ground (ကစားကြင္း) [gaza: gwin], game (ကစားပြဲ) [gaza pwe],)
List of possible stop words
No. Part of Speech Example
1 Subject personal pronouns I (ကၽြန္ေတာ္) [kja no], we(ကၽြႏု္တို႔) [kja no do], he
(သူ) [thu], she(သူမ) [thu ma], it (ထိုAရာ)
[htou],
2 Object personal pronouns Me (ကၽြန္ေတာ္) [kjanou’ ko], us(ကၽြႏုပိတို႔ကို) [thu tou
ko ], him (သူကို) [thu ko], her (သူမကို) [thu ma
ko]
3 Possessive pronouns and adjectives Mine (က်ေနာ္၏) [kjanou’ i.], your (သူ၏) [thu i.] ,
his (သူ၏) [thu i. ], her (သူမ၏) [thu ma i.]
6 Indefinite pronouns and adjectives Some (Aခ်ိဳ႕) [a.chou ], few (Aနည္းငယ္) [ a ne:
nge], none (စိုးစU္မွ်)
7 Demonstrative pronouns and This (ဤAရာ) [i. ha ], that (ထိုAရာ) [htou ha],
adjectives these(ေဟာဒီ), those (ဟိုဟာ)
8 Interrogative Pronoun and Questions Who (မည္သူ) [be thu], when (ဘယ္ေနရာ) [be], how
(ဘယ္လို) [be lou], what (ဘယ္လဲ [be le]
Maintenance and Updating
Every language has new words (slang), which are daily
language usage.
3) Word segmentation for the Myanmar language, Tun Thura Thet; Jin-Cheon
Na; Wunna Ko Ko at:
4) Statistical Analysis of Myanmar Words on the World Wide Web for Search
Engine Development, Pann Yu Mon; Maung Maung Thant; Ohnmar Htun Pe;
San Ko Oo; Yoshiki Mikami, Management and Information Systems
Engineering Department, Nagaoka University of Technology, International
University of Japan