Professional Documents
Culture Documents
1. Introduction
Sinhalese or Sinhala (, ISO 15919: sihala, pronounced [sihl], earlier referred to as
Singhalese) is the language of the Sinhalese, the largest ethnic group of Sri Lanka. It belongs to
the Indo-Aryan language family. Sinhala is spoken by about 19 million people in Sri Lanka,
about 16 million of them are native speakers. It is one of the constitutionally-recognized official
languages of Sri Lanka, along with Tamil. Sinhala has its own writing system (see Sinhala script)
which is an offspring of the Indian Brahmi script. The oldest Sinhala inscriptions were written in
the 3rd and 2nd centuries BCE; the oldest existing literary works date from the 9th century CE.
The closest relative of Sinhala is the language of the Maldives, Dhivehi. The word order of
Sinhala language is SOV (subject-object-verb) order and Japanese, Korean and most of other
languages in Asia follows this order.[1]
As well as, when talking about Natural Language Processing (NLP). It means the ability of a
computer program to understand human speech as it is spoken. NLP is a component of artificial
intelligence (AI). The development of NLP applications is challenging because computers
traditionally require humans to speak to them in a programming language that is precise,
unambiguous and highly structured or, perhaps through a limited number of clearly-enunciated
voice commands. Human speech, however, is not always precise -- it is often ambiguous and the
linguistic structure can depend on many complex variables, including slang, regional dialects and
social context. Current approaches to NLP are based on machine learning, a type of artificial
intelligence that examines and uses patterns in data to improve a program's own
understanding.[2]
This project effort is about to combine Sinhala language with NLP system for survival of this
grate language in to future generations. It is a combination with Sinhala Speech to text system,
Sinhala text to speech system and Sinhala OCR system engaging with Google Translator. This
papers second chapter (Background & Motivation) explain about present situations of Sinhala
language processing. Third chapter (Problem in brief) explains the problem. Fourth chapter
explains the aims and objectives of this project. Then, proposed solution and its design have been
discussed using through relevant resources.
1
Dictionary. There is still not proper Sinhala Language Processing System based relevant Sinhala
language own core. But, it is needed for the survival of Sinhala Language with IT world.
3. Problem in Brief
Within the modern world, Our Sinhala language have been alone because most of them do not
try to do something new regarding Sinhala language related to Information Technology world.
English, French, German, Japanese and like languages have been build their own strong position
in IT world.
So, Sinhala Language Speech to Text System is still not developed. . It will be used for disability
people who cant write.
As well as, there is still not available proper accurate Sinhala Language Text to Speech (TTS)
System. Some Sinhala TTS systems have been developed but they havent their own core. They
are based on English dictionaries.
Also, there is still a need of proper Optical Character Recognition System for Sinhala language
and then it helps to translate recognized details to another language. These systems will be very
helpful for foreign tourist in Sri Lanka. Because foreign tourist face big problem of translating
local language to their own language.
Intelligent applications (like Google Now, Cortana and Siri) are still not event started to develop
for Sinhala Language because the there is no any proper API for Sinhala Language Processings.
II.
III.
Sinhala Text detection and its translation system to given particular language.
5. Proposed Solution
This chapter proposed a possible software solution for fulfilling the aim and overcoming the
problems that mentioned previous chapters. Therefore, we are going to develop a Desktop
Application for Sinhala language processing. The proposed application will be able to do the
conversion of Sinhala speech to Sinhala text, Sinhala text to speech (male/female), Sinhala
optical character recognition system and Sinhala to given language translation system. Google
Translator service will be hoped to use for translation activities.
Sinhala Text to Speech
Inputs: Sinhala texts
Process:
Linguistic formalisms
Inferences engines
Logical inferences
Mathematical Models
Algorithm computations
SPEECH
Input
Time Algorithm
Filtering
+
Other Algorithm
Speech
Database
Feature
Extraction
Comparison
Convert to
Sinhala Text
Recognized
Speech
Figure 5.1 Sinhala STT system overview
Sinhala Text detection (OCR) and its translation to given particular language
The currently available Sinhala OCR Systems supports only one font style and one font size at a
time, when it needs to be processed by the OCR engine. The currently available Sinhala OCR
engine is unable to handle low quality images with noise and other disturbances, the accuracy is
drastically low and sometimes it wont even produce an output. Therefore it is a necessity to
enhance or come up with a new Sinhala OCR engine which is having the capability of image
preprocessing and which supports multi-fonts and multi-size character recognition.Our priority is
to fill that gap.
Now google translate is also available in Sinhala language.so it is easy to translate detect word
in to any language. Detect words also used to our proposed solution text to speech application.
6. Design
The proposed application is contained major 3 functions. Sinhala speech to text, text to speech
and OCR identified Sinhala text to translate given language are the main functions. At the
beginning, windows desktop application is hoped to develop.
The speech text interface which provides methods to start, pause, resume, fast forward, rewind, and stop
the TTS engine during speech. The attribute interface allows access to control the basic behavior of the
TTS engine.
7. Resource Requirements
It is essential of good Sinhala voice audio file collection to do the character analysis and
properly train the system.
8. References
[1] Sinhala alphabet, pronunciation and language - Omniglot
http://www.omniglot.com/writing/sinhala.htm
19th February 2015
[2] What is natural language processing (NLP)? Definition
http://searchcontentmanagement.techtarget.com/definition/natural-language-processing-NLP
19th February 2015
[3] Language Technology Research Laboratory - University of Colombo School of Computing
http://ucsc.cmb.ac.lk/ltrl/?page=home&lang=en&style=default
19th February 2015
[4] AMoRA Sinhala TTS University of Moratuwa
http://lms.uom.lk/sf/shantha/Project-web-sites/2009-10/PI-58-AMoRA/index.html
19th February 2015
10
Appendix
Action Plan (Time Line for 2015)
Feb
Mar
Apr
May
June
July
Aug
Sept
Oct
Nov
1, 2 &3
4
5
6
7
8
9
10
1. Study about related projects and Prepare with relevant requirements to start the project.
2. Gather Requirements for starting the project.
3. Design and Plan about the product.
4. Start the implementations.
5. Release a basic product with test cases for interim evaluation.
6. Start second iteration with gathering complex requirements
7. Design, Plan and implementation for publishing a complete product
8. Deeply consider about user visualization parts and make it in proper manner, considering
with HCI
9. Finalize the project with test cases
10. Prepare for final evaluation.
11